RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi



Thanks Mathieu,
So either I must have shared filesystem OR Hadoop as filesystem in order to write data from Standalone mode cluster setup environment. Thanks for your input.


Regards
Stuti Awasthi



From: Mathieu Longtin [math...@closetwork.org]
Sent: Tuesday, May 24, 2016 7:34 PM
To: Stuti Awasthi; Jacek Laskowski
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.




In standalone mode, executor assume they have access to a shared file system. The driver creates the directory and the executor write files, so the executors end up not writing anything since there is no local directory.


On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi  wrote:



hi Jacek,


Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit.
Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti.


Thanks
Stuti 


From: Jacek Laskowski [ja...@japila.pl]
Sent: Tuesday, May 24, 2016 5:27 PM
To: Stuti Awasthi
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.












Hi, 
What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS?

Jacek
On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:



Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When
 I try to write my output in local filesystem path then I receive below error .
 
I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions.
 
ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
 
What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode.
 
Please suggest.
 
Thanks 
Stuti Awasthi
 




::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,

lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents

(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.

Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the

views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,

distribution and / or publication of this message without the prior written consent of authorized representative of

HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.

Before opening any email and/or attachments, please check them for viruses and other defects.
















- To unsubscribe, e-mail:

Using Java in Spark shell

2016-05-24 Thread Ashok Kumar
Hello,
A newbie question.
Is it possible to use java code directly in spark shell without using maven to 
build a jar file?
How can I switch from scala to java in spark shell?
Thanks



job build cost more and more time

2016-05-24 Thread naliazheli
i am using spark1.6 and noticed  time between jobs get longer,sometimes it
could be 20 mins.
i tried to search same questions ,and found a close one :
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146

and found something useful:
One thing to worry about is long-running jobs or shells. Currently, state
buildup of a single job in Spark is a problem, as certain state such as
shuffle files and RDD metadata is not cleaned up until the job (or shell)
exits. We have hacky ways to reduce this, and are working on a long term
solution. However, separate, consecutive jobs should be independent in terms
of performance.


On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[hidden email]> wrote:
Is your spark app an iterative one ? If so, your app is creating a big DAG
in every iteration. You should use checkpoint it periodically, say, 10
iterations one checkpoint.

i also wrote a test program,there is the code:

public static void newJob(int jobNum,SQLContext sqlContext){
for(int i=0;i

Re: Dataset Set Operations

2016-05-24 Thread Michael Armbrust
What is the schema of the case class?

On Tue, May 24, 2016 at 3:46 PM, Tim Gautier  wrote:

> Hello All,
>
> I've been trying to subtract one dataset from another. Both datasets
> contain case classes of the same type. When I subtract B from A, I end up
> with a copy of A that still has the records of B in it. (An intersection of
> A and B always results in 0 results.) All I can figure is that spark is
> doing an equality check that determines nothing matches. What is that
> equality function and is there some way I can change it?
>
> Thanks
> Tim
>


Dataset Set Operations

2016-05-24 Thread Tim Gautier
Hello All,

I've been trying to subtract one dataset from another. Both datasets
contain case classes of the same type. When I subtract B from A, I end up
with a copy of A that still has the records of B in it. (An intersection of
A and B always results in 0 results.) All I can figure is that spark is
doing an equality check that determines nothing matches. What is that
equality function and is there some way I can change it?

Thanks
Tim


Re: How does Spark set task indexes?

2016-05-24 Thread Ted Yu
Have you taken a look at SPARK-14915 ?

On Tue, May 24, 2016 at 1:00 PM, Adrien Mogenet <
adrien.moge...@contentsquare.com> wrote:

> Hi,
>
> I'm wondering how Spark is setting the "index" of task?
> I'm asking this question because we have a job that constantly fails at
> task index = 421.
>
> When increasing number of partitions, this then fails at index=4421.
> Increase it a little bit more, now it's 24421.
>
> Our job is as simple as "(1) read json -> (2) group-by sesion identifier
> -> (3) write parquet files" and always fails somewhere at step (3) with a
> CommitDeniedException. We've identified that some troubles are basically
> due to uneven data repartition right after step (2), and now try to go
> further in our understanding on how does Spark behaves.
>
> We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0
>
> --
>
> *Adrien Mogenet*
> Head of Backend/Infrastructure
> adrien.moge...@contentsquare.com
> http://www.contentsquare.com
> 50, avenue Montaigne - 75008 Paris
>


Error while saving plots

2016-05-24 Thread njoshi
For an analysis app, I have to make ROC curves on the fly and save to disc. I
am using scala-chart for this purpose and doing the following in my Spark
app:


val rocs = performances.map{case (id, (auRoc, roc)) => (id,
roc.collect().toList)}
XYLineChart(rocs.toSeq, title = "Pooled Data Performance:
AuROC").saveAsPNG(outputpath + "/plots/global.png")


However, I am getting the following exception. Does anyone have idea of the
cause?


Exception in thread "main" java.io.FileNotFoundException:
file:/home/njoshi/dev/outputs/test_/plots/global.png (No such file or
directory)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at java.io.FileOutputStream.(FileOutputStream.java:101)
at
scalax.chart.exporting.PNGExporter$.saveAsPNG$extension(exporting.scala:138)
at
com.aol.advertising.ml.globaldata.EvaluatorDriver$.main(EvaluatorDriver.scala:313)
at
com.aol.advertising.ml.globaldata.EvaluatorDriver.main(EvaluatorDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks in advance,

Nikhil



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-saving-plots-tp27016.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Pradeep Nayak
BTW, I am using a 6-node cluster with m4.2xlarge machines on amazon. I have
tried with both yarn-cluster and spark's native cluster mode as well.

On Tue, May 24, 2016 at 12:10 PM Mathieu Longtin 
wrote:

> I have been seeing the same behavior in standalone with a master.
>
>
> On Tue, May 24, 2016 at 3:08 PM Pradeep Nayak 
> wrote:
>
>>
>>
>> I have posted the same question of Stack Overflow:
>> http://stackoverflow.com/questions/37421852/spark-submit-continues-to-hang-after-job-completion
>>
>> I am trying to test spark 1.6 with hdfs in AWS. I am using the wordcount
>> python example available in the examples folder. I submit the job with
>> spark-submit, the job completes successfully and its prints the results on
>> the console as well. The web-UI also says its completed. However the
>> spark-submit never terminates. I have verified that the context is stopped
>> in the word count example code as well.
>>
>> What could be wrong ?
>>
>> This is what I see on the console.
>>
>>
>> 6-05-24 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
>> (ContextHandler.java:doStop(843)) - stopped 
>> o.s.j.s.ServletContextHandler{/stages/stage,null}2016-05-24 14:58:04,749 
>> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
>> stopped o.s.j.s.ServletContextHandler{/stages/json,null}2016-05-24 
>> 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
>> (ContextHandler.java:doStop(843)) - stopped 
>> o.s.j.s.ServletContextHandler{/stages,null}2016-05-24 14:58:04,749 INFO  
>> [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
>> stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}2016-05-24 
>> 14:58:04,750 INFO  [Thread-3] handler.ContextHandler 
>> (ContextHandler.java:doStop(843)) - stopped 
>> o.s.j.s.ServletContextHandler{/jobs/job,null}2016-05-24 14:58:04,750 INFO  
>> [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
>> stopped o.s.j.s.ServletContextHandler{/jobs/json,null}2016-05-24 
>> 14:58:04,750 INFO  [Thread-3] handler.ContextHandler 
>> (ContextHandler.java:doStop(843)) - stopped 
>> o.s.j.s.ServletContextHandler{/jobs,null}2016-05-24 14:58:04,802 INFO  
>> [Thread-3] ui.SparkUI (Logging.scala:logInfo(58)) - Stopped Spark web UI at 
>> http://172.30.2.239:40402016-05-24 14:58:04,805 INFO  [Thread-3] 
>> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Shutting 
>> down all executors2016-05-24 14:58:04,805 INFO  [dispatcher-event-loop-2] 
>> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Asking 
>> each executor to shut down2016-05-24 14:58:04,814 INFO  
>> [dispatcher-event-loop-5] spark.MapOutputTrackerMasterEndpoint 
>> (Logging.scala:logInfo(58)) - MapOutputTrackerMasterEndpoint 
>> stopped!2016-05-24 14:58:04,818 INFO  [Thread-3] storage.MemoryStore 
>> (Logging.scala:logInfo(58)) - MemoryStore cleared2016-05-24 14:58:04,818 
>> INFO  [Thread-3] storage.BlockManager (Logging.scala:logInfo(58)) - 
>> BlockManager stopped2016-05-24 14:58:04,820 INFO  [Thread-3] 
>> storage.BlockManagerMaster (Logging.scala:logInfo(58)) - BlockManagerMaster 
>> stopped2016-05-24 14:58:04,821 INFO  [dispatcher-event-loop-3] 
>> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint 
>> (Logging.scala:logInfo(58)) - OutputCommitCoordinator stopped!2016-05-24 
>> 14:58:04,824 INFO  [Thread-3] spark.SparkContext (Logging.scala:logInfo(58)) 
>> - Successfully stopped SparkContext2016-05-24 14:58:04,827 INFO  
>> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
>> remote.RemoteActorRefProvider$RemotingTerminator 
>> (Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote 
>> daemon.2016-05-24 14:58:04,828 INFO  
>> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
>> remote.RemoteActorRefProvider$RemotingTerminator 
>> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding 
>> with flushing remote transports.2016-05-24 14:58:04,843 INFO  
>> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
>> remote.RemoteActorRefProvider$RemotingTerminator 
>> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting shut down.
>>
>>
>> I have to do a ctrl-c to terminate the spark-submit process. This is
>> really a weird problem and I have no idea how to fix this. Please let me
>> know if there are any logs I should be looking at, or doing things
>> differently here.
>>
>>
>> --
> Mathieu Longtin
> 1-514-803-8977
>


How does Spark set task indexes?

2016-05-24 Thread Adrien Mogenet
Hi,

I'm wondering how Spark is setting the "index" of task?
I'm asking this question because we have a job that constantly fails at
task index = 421.

When increasing number of partitions, this then fails at index=4421.
Increase it a little bit more, now it's 24421.

Our job is as simple as "(1) read json -> (2) group-by sesion identifier ->
(3) write parquet files" and always fails somewhere at step (3) with a
CommitDeniedException. We've identified that some troubles are basically
due to uneven data repartition right after step (2), and now try to go
further in our understanding on how does Spark behaves.

We're using Spark 1.5.2, scala 2.11, on top of hadoop 2.6.0

-- 

*Adrien Mogenet*
Head of Backend/Infrastructure
adrien.moge...@contentsquare.com
http://www.contentsquare.com
50, avenue Montaigne - 75008 Paris


Error publishing to spark-packages

2016-05-24 Thread Neville Li
Hi guys,


I built a spark package but couldn't publish them with sbt-spark-package
plugin. Any idea why these are failing?
http://spark-packages.org/staging?id=1179
http://spark-packages.org/staging?id=1168

Repo: https://github.com/spotify/spark-bigquery
Jars are published to Maven: https://repo1.maven.org/maven2/com/spotify/
spark-bigquery_2.10/

Emailed feedb...@spark-packages.org a few days ago but haven't gotten any
response so far.

Cheers,
Neville


Re: Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread Cody Koeninger
Have you looked at everything linked from

https://github.com/koeninger/kafka-exactly-once


On Tue, May 24, 2016 at 2:07 PM, sagarcasual .  wrote:
> In spark streaming consuming kafka using KafkaUtils.createDirectStream,
> there are examples of the kafka offset level ranges. However if
> 1. I would like periodically maintain offset level so that if needed I can
> reprocess items from a offset. Is there any way I can retrieve offset of a
> message in rdd while I am processing each message?
> 2. Also with offsetranges, I have start and end offset for the RDD, but what
> if while processing each record of the RDD system encounters and error and
> job ends. Now if I want to begin processing from the record that failed, how
> do I first save the last successful offset so that I can start with that
> when starting next time.
>
> Appreciate your help.
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Mathieu Longtin
I have been seeing the same behavior in standalone with a master.

On Tue, May 24, 2016 at 3:08 PM Pradeep Nayak  wrote:

>
>
> I have posted the same question of Stack Overflow:
> http://stackoverflow.com/questions/37421852/spark-submit-continues-to-hang-after-job-completion
>
> I am trying to test spark 1.6 with hdfs in AWS. I am using the wordcount
> python example available in the examples folder. I submit the job with
> spark-submit, the job completes successfully and its prints the results on
> the console as well. The web-UI also says its completed. However the
> spark-submit never terminates. I have verified that the context is stopped
> in the word count example code as well.
>
> What could be wrong ?
>
> This is what I see on the console.
>
>
> 6-05-24 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
> (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/stages/stage,null}2016-05-24 14:58:04,749 INFO 
>  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/stages/json,null}2016-05-24 
> 14:58:04,749 INFO  [Thread-3] handler.ContextHandler 
> (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/stages,null}2016-05-24 14:58:04,749 INFO  
> [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - stopped 
> o.s.j.s.ServletContextHandler{/jobs/job/json,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs/job,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs/json,null}2016-05-24 14:58:04,750 
> INFO  [Thread-3] handler.ContextHandler (ContextHandler.java:doStop(843)) - 
> stopped o.s.j.s.ServletContextHandler{/jobs,null}2016-05-24 14:58:04,802 INFO 
>  [Thread-3] ui.SparkUI (Logging.scala:logInfo(58)) - Stopped Spark web UI at 
> http://172.30.2.239:40402016-05-24 14:58:04,805 INFO  [Thread-3] 
> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Shutting 
> down all executors2016-05-24 14:58:04,805 INFO  [dispatcher-event-loop-2] 
> cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) - Asking each 
> executor to shut down2016-05-24 14:58:04,814 INFO  [dispatcher-event-loop-5] 
> spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(58)) - 
> MapOutputTrackerMasterEndpoint stopped!2016-05-24 14:58:04,818 INFO  
> [Thread-3] storage.MemoryStore (Logging.scala:logInfo(58)) - MemoryStore 
> cleared2016-05-24 14:58:04,818 INFO  [Thread-3] storage.BlockManager 
> (Logging.scala:logInfo(58)) - BlockManager stopped2016-05-24 14:58:04,820 
> INFO  [Thread-3] storage.BlockManagerMaster (Logging.scala:logInfo(58)) - 
> BlockManagerMaster stopped2016-05-24 14:58:04,821 INFO  
> [dispatcher-event-loop-3] 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint 
> (Logging.scala:logInfo(58)) - OutputCommitCoordinator stopped!2016-05-24 
> 14:58:04,824 INFO  [Thread-3] spark.SparkContext (Logging.scala:logInfo(58)) 
> - Successfully stopped SparkContext2016-05-24 14:58:04,827 INFO  
> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote daemon.2016-05-24 
> 14:58:04,828 INFO  [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down; proceeding 
> with flushing remote transports.2016-05-24 14:58:04,843 INFO  
> [sparkDriverActorSystem-akka.actor.default-dispatcher-2] 
> remote.RemoteActorRefProvider$RemotingTerminator 
> (Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting shut down.
>
>
> I have to do a ctrl-c to terminate the spark-submit process. This is
> really a weird problem and I have no idea how to fix this. Please let me
> know if there are any logs I should be looking at, or doing things
> differently here.
>
>
> --
Mathieu Longtin
1-514-803-8977


Spark-submit hangs indefinitely after job completion.

2016-05-24 Thread Pradeep Nayak
I have posted the same question of Stack Overflow:
http://stackoverflow.com/questions/37421852/spark-submit-continues-to-hang-after-job-completion

I am trying to test spark 1.6 with hdfs in AWS. I am using the wordcount
python example available in the examples folder. I submit the job with
spark-submit, the job completes successfully and its prints the results on
the console as well. The web-UI also says its completed. However the
spark-submit never terminates. I have verified that the context is stopped
in the word count example code as well.

What could be wrong ?

This is what I see on the console.


6-05-24 14:58:04,749 INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/stages/stage,null}2016-05-24
14:58:04,749 INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/stages/json,null}2016-05-24
14:58:04,749 INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/stages,null}2016-05-24 14:58:04,749
INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/jobs/job/json,null}2016-05-24
14:58:04,750 INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/jobs/job,null}2016-05-24 14:58:04,750
INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/jobs/json,null}2016-05-24 14:58:04,750
INFO  [Thread-3] handler.ContextHandler
(ContextHandler.java:doStop(843)) - stopped
o.s.j.s.ServletContextHandler{/jobs,null}2016-05-24 14:58:04,802 INFO
[Thread-3] ui.SparkUI (Logging.scala:logInfo(58)) - Stopped Spark web
UI at http://172.30.2.239:40402016-05-24 14:58:04,805 INFO  [Thread-3]
cluster.SparkDeploySchedulerBackend (Logging.scala:logInfo(58)) -
Shutting down all executors2016-05-24 14:58:04,805 INFO
[dispatcher-event-loop-2] cluster.SparkDeploySchedulerBackend
(Logging.scala:logInfo(58)) - Asking each executor to shut
down2016-05-24 14:58:04,814 INFO  [dispatcher-event-loop-5]
spark.MapOutputTrackerMasterEndpoint (Logging.scala:logInfo(58)) -
MapOutputTrackerMasterEndpoint stopped!2016-05-24 14:58:04,818 INFO
[Thread-3] storage.MemoryStore (Logging.scala:logInfo(58)) -
MemoryStore cleared2016-05-24 14:58:04,818 INFO  [Thread-3]
storage.BlockManager (Logging.scala:logInfo(58)) - BlockManager
stopped2016-05-24 14:58:04,820 INFO  [Thread-3]
storage.BlockManagerMaster (Logging.scala:logInfo(58)) -
BlockManagerMaster stopped2016-05-24 14:58:04,821 INFO
[dispatcher-event-loop-3]
scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint
(Logging.scala:logInfo(58)) - OutputCommitCoordinator
stopped!2016-05-24 14:58:04,824 INFO  [Thread-3] spark.SparkContext
(Logging.scala:logInfo(58)) - Successfully stopped
SparkContext2016-05-24 14:58:04,827 INFO
[sparkDriverActorSystem-akka.actor.default-dispatcher-2]
remote.RemoteActorRefProvider$RemotingTerminator
(Slf4jLogger.scala:apply$mcV$sp(74)) - Shutting down remote
daemon.2016-05-24 14:58:04,828 INFO
[sparkDriverActorSystem-akka.actor.default-dispatcher-2]
remote.RemoteActorRefProvider$RemotingTerminator
(Slf4jLogger.scala:apply$mcV$sp(74)) - Remote daemon shut down;
proceeding with flushing remote transports.2016-05-24 14:58:04,843
INFO  [sparkDriverActorSystem-akka.actor.default-dispatcher-2]
remote.RemoteActorRefProvider$RemotingTerminator
(Slf4jLogger.scala:apply$mcV$sp(74)) - Remoting shut down.


I have to do a ctrl-c to terminate the spark-submit process. This is really
a weird problem and I have no idea how to fix this. Please let me know if
there are any logs I should be looking at, or doing things differently here.


Maintain kafka offset externally as Spark streaming processes records.

2016-05-24 Thread sagarcasual .
In spark streaming consuming kafka using KafkaUtils.createDirectStream,
there are examples of the kafka offset level ranges. However if
1. I would like periodically maintain offset level so that if needed I can
reprocess items from a offset. Is there any way I can retrieve offset of a
message in rdd while I am processing each message?
2. Also with offsetranges, I have start and end offset for the RDD, but
what if while processing each record of the RDD system encounters and error
and job ends. Now if I want to begin processing from the record that
failed, how do I first save the last successful offset so that I can start
with that when starting next time.

Appreciate your help.


Re: How to read *.jhist file in Spark using scala

2016-05-24 Thread Miles
Instead of reading *.jhist files direclty in Spark, you could convert your
.jhist files into Json and then read Json files in Spark.

Here's a post on converting .jhist file to json format.
http://stackoverflow.com/questions/32683907/converting-jhist-files-to-json-format



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-jhist-file-in-Spark-using-scala-tp26972p27015.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-24 Thread chandan prakash
Resolved.
Used passing  parameters in sparkConf instead of passing to spark-submit
command : (still dont know why passing to spark-submit command did not work)

 sparkConf.set("spark.executor.extraJavaOptions", "-XX:+UseG1GC
-Dlog4j.configuration=file:log4j_RequestLogExecutor.properties ")




On Tue, May 24, 2016 at 10:24 PM, chandan prakash  wrote:

> Any suggestion?
>
> On Mon, May 23, 2016 at 5:18 PM, chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
>> Hi,
>> I am able to do logging for driver but not for executor.
>>
>> I am running spark streaming under mesos.
>> Want to do log4j logging separately for driver and executor.
>>
>> Used the below option in spark-submit command :
>>
>> --driver-java-options 
>> "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogDriver.properties"
>>  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:
>> /usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.properties
>> "
>>
>> Logging for driver at path mentioned as in
>> log4j_RequestLogDriver.properties(/tmp/requestLogDriver.log) is
>> happening fine.
>> But for executor, there is no logging happening (shud be at
>> /tmp/requestLogExecutor.log as mentioned in 
>> log4j_RequestLogExecutor.properties
>> on executor machines)
>>
>> *Any suggestions how to get logging enabled for executor ?*
>>
>> TIA,
>> Chandan
>>
>> --
>> Chandan Prakash
>>
>>
>
>
> --
> Chandan Prakash
>
>


-- 
Chandan Prakash


Re: Spark Streaming with Kafka

2016-05-24 Thread Rasika Pohankar
Hi firemonk9,

Sorry, its been too long but I just saw this. I hope you were able to
resolve it. FWIW, we were able to solve this with the help of the Low Level
Kafka Consumer, instead of the inbuilt Kafka consumer in Spark, from here:
https://github.com/dibbhatt/kafka-spark-consumer/.

Regards,
Rasika.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-with-Kafka-tp21222p27014.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming: issue with logging with separate log4j properties files for driver and executor

2016-05-24 Thread chandan prakash
Any suggestion?

On Mon, May 23, 2016 at 5:18 PM, chandan prakash 
wrote:

> Hi,
> I am able to do logging for driver but not for executor.
>
> I am running spark streaming under mesos.
> Want to do log4j logging separately for driver and executor.
>
> Used the below option in spark-submit command :
>
> --driver-java-options 
> "-Dlog4j.configuration=file:/usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogDriver.properties"
>  --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:
> /usr/local/spark-1.5.1-bin-hadoop2.6/conf/log4j_RequestLogExecutor.properties
> "
>
> Logging for driver at path mentioned as in
> log4j_RequestLogDriver.properties(/tmp/requestLogDriver.log) is happening
> fine.
> But for executor, there is no logging happening (shud be at
> /tmp/requestLogExecutor.log as mentioned in 
> log4j_RequestLogExecutor.properties
> on executor machines)
>
> *Any suggestions how to get logging enabled for executor ?*
>
> TIA,
> Chandan
>
> --
> Chandan Prakash
>
>


-- 
Chandan Prakash


Possible bug involving Vectors with a single element

2016-05-24 Thread flyinggip
Hi there, 

I notice that there might be a bug in pyspark.mllib.linalg.Vectors when
dealing with a vector with a single element. 

Firstly, the 'dense' method says it can also take numpy.array. However the
code uses 'if len(elements) == 1' and when a numpy.array has only one
element its length is undefined and currently if calling dense() on a numpy
array with one element the program crashes. Probably instead of using len()
in the above if, size should be used. 

Secondly, after I managed to create a dense-Vectors object with only one
element from unicode, it seems that its behaviour is unpredictable. For
example, 

Vectors.dense(unicode("0.1"))

will report an error. 

dense_vec = Vectors.dense(unicode("0.1"))

will NOT report any error until you run 

dense_vec

to check its value. And the following will be able to create a successful
DataFrame: 

mylist = [(0, Vectors.dense(unicode("0.1")))]
myrdd = sc.parallelize(mylist)
mydf = sqlContext.createDataFrame(myrdd, ["X", "Y"])

However if the above unicode value is read from a text file (e.g., a csv
file with 2 columns) then the DataFrame column corresponding to "Y" will be
EMPTY: 

raw_data = sc.textFile(filename)
split_data = raw_data.map(lambda line: line.split(','))
parsed_data = split_data.map(lambda line: (int(line[0]),
Vectors.dense(line[1])))
mydf = sqlContext.createDataFrame(parsed_data, ["X", "Y"])

It would be great if someone could share some ideas. Thanks a lot. 

f. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Possible-bug-involving-Vectors-with-a-single-element-tp27013.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Mathieu Longtin
In standalone mode, executor assume they have access to a shared file
system. The driver creates the directory and the executor write files, so
the executors end up not writing anything since there is no local directory.

On Tue, May 24, 2016 at 8:01 AM Stuti Awasthi  wrote:

> hi Jacek,
>
> Parent directory already present, its my home directory. Im using Linux
> (Redhat) machine 64 bit.
> Also I noticed that "test1" folder is created in my master with
> subdirectory as "_temporary" which is empty. but on slaves, no such
> directory is created under /home/stuti.
>
> Thanks
> Stuti
> --
> *From:* Jacek Laskowski [ja...@japila.pl]
> *Sent:* Tuesday, May 24, 2016 5:27 PM
> *To:* Stuti Awasthi
> *Cc:* user
> *Subject:* Re: Not able to write output to local filsystem from
> Standalone mode.
>
> Hi,
>
> What happens when you create the parent directory /home/stuti? I think the
> failure is due to missing parent directories. What's the OS?
>
> Jacek
> On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:
>
> Hi All,
>
> I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
> Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
> shell , read the input file from local filesystem and perform
> transformation successfully. When I try to write my output in local
> filesystem path then I receive below error .
>
>
>
> I tried to search on web and found similar Jira :
> https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
> resolved for Spark 1.3+ but already people have posted the same issue still
> persists in latest versions.
>
>
>
> *ERROR*
>
> scala> data.saveAsTextFile("/home/stuti/test1")
>
> 16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
> server1): java.io.IOException: The temporary job-output directory
> file:/home/stuti/test1/_temporary doesn't exist!
>
> at
> org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
>
> at
> org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
>
> at
> org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
>
> at
> org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
>
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
>
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> What is the best way to resolve this issue if suppose I don’t want to have
> Hadoop installed OR is it mandatory to have Hadoop to write the output from
> Standalone cluster mode.
>
>
>
> Please suggest.
>
>
>
> Thanks 
>
> Stuti Awasthi
>
>
>
>
>
> ::DISCLAIMER::
>
> 
>
> The contents of this e-mail and any attachment(s) are confidential and
> intended for the named recipient(s) only.
> E-mail transmission is not guaranteed to be secure or error-free as
> information could be intercepted, corrupted,
> lost, destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents
> (with or without referred errors) shall therefore not attach any liability
> on the originator or HCL or its affiliates.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the
> views or opinions of HCL or its affiliates. Any form of reproduction,
> dissemination, copying, disclosure, modification,
> distribution and / or publication of this message without the prior
> written consent of authorized representative of
> HCL is strictly prohibited. If you have received this email in error
> please delete it and notify the sender immediately.
> Before opening any email and/or attachments, please check them for viruses
> and other defects.
>
>
> 
>
> - To
> unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org

-- 
Mathieu Longtin
1-514-803-8977


Re: Hive_context

2016-05-24 Thread Ajay Chander
Hi Arun,

Thanks for your time. I was able to connect through JDBC Java client. But I
am not able to connect from my spark application. You think I missed any
configuration step with in the code? Somehow my application is not picking
up  hive-site.xml from my machine, I put it under the class
path  ${SPARK_HOME}/conf/ . It would be really helpful if anyone has any
sort of example in either Java or Scala? Thank you.

On Monday, May 23, 2016, Arun Natva  wrote:

> Can you try a hive JDBC java client from eclipse and query a hive table
> successfully ?
>
> This way we can narrow down where the issue is ?
>
>
> Sent from my iPhone
>
> On May 23, 2016, at 5:26 PM, Ajay Chander  > wrote:
>
> I downloaded the spark 1.5 untilities and exported SPARK_HOME pointing to
> it. I copied all the cluster configuration files(hive-site.xml,
> hdfs-site.xml etc files) inside the ${SPARK_HOME}/conf/ . My application
> looks like below,
>
>
> public class SparkSqlTest {
>
> public static void main(String[] args) {
>
>
> SparkConf sc = new SparkConf().setAppName("SQL_Test").setMaster("local");
>
> JavaSparkContext jsc = new JavaSparkContext(sc);
>
> HiveContext hiveContext = new org.apache.spark.sql.hive.HiveContext(jsc
> .sc());
>
> DataFrame sampleDataFrame = hiveContext.sql("show tables");
>
> sampleDataFrame.show();
>
>
> }
>
> }
>
>
> I am expecting my application to return all the tables from the default
> database. But somehow it returns empty list. I am just wondering if I need
> to add anything to my code to point it to hive metastore. Thanks for your
> time. Any pointers are appreciated.
>
>
> Regards,
>
> Aj
>
>
> On Monday, May 23, 2016, Ajay Chander  > wrote:
>
>> Hi Everyone,
>>
>> I am building a Java Spark application in eclipse IDE. From my
>> application I want to use hiveContext to read tables from the remote
>> Hive(Hadoop cluster). On my machine I have exported $HADOOP_CONF_DIR =
>> {$HOME}/hadoop/conf/. This path has all the remote cluster conf details
>> like hive-site.xml, hdfs-site.xml ... Somehow I am not able to communicate
>> to remote cluster from my app. Is there any additional configuration work
>> that I am supposed to do to get it work? I specified master as 'local' in
>> the code. Thank you.
>>
>> Regards,
>> Aj
>>
>


Re: Using HiveContext.set in multipul threads

2016-05-24 Thread Silvio Fiorito
If you’re using DataFrame API you can achieve that by simply using (or not) the 
“partitionBy” method on the DataFrameWriter:

val originalDf = ….

val df1 = originalDf….
val df2 = originalDf…

df1.write.partitionBy(”col1”).save(…)

df2.write.save(…)

From: Amir Gershman 
Date: Tuesday, May 24, 2016 at 7:01 AM
To: "user@spark.apache.org" 
Subject: Using HiveContext.set in multipul threads

Hi,

I have a DataFrame I compute from a long chain of transformations.
I cache it, and then perform two additional transformations on it.
I use two Futures - each Future will insert the content of one of the above 
Dataframe to a different hive table.
One Future must SET hive.exec.dynamic.partition=true and the other must set it 
to false.



How can I run both INSERT commands in parallel, but guarantee each runs with 
its own settings?



If I don't use the same HiveContext then the initial long chain of 
transformations which I cache is not reusable between HiveContexts. If I use 
the same HiveContext, race conditions between threads my cause one INSERT to 
execute with the wrong config.



RE: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi



hi Jacek,


Parent directory already present, its my home directory. Im using Linux (Redhat) machine 64 bit.
Also I noticed that "test1" folder is created in my master with subdirectory as "_temporary" which is empty. but on slaves, no such directory is created under /home/stuti.


Thanks
Stuti 


From: Jacek Laskowski [ja...@japila.pl]
Sent: Tuesday, May 24, 2016 5:27 PM
To: Stuti Awasthi
Cc: user
Subject: Re: Not able to write output to local filsystem from Standalone mode.




Hi, 
What happens when you create the parent directory /home/stuti? I think the failure is due to missing parent directories. What's the OS?

Jacek
On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:



Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read the input file from local filesystem and perform transformation successfully. When
 I try to write my output in local filesystem path then I receive below error .
 
I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows resolved for Spark 1.3+ but already people have posted the same issue still persists in latest versions.
 
ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, server1): java.io.IOException: The temporary job-output directory file:/home/stuti/test1/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
    at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
    at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
 
What is the best way to resolve this issue if suppose I don’t want to have Hadoop installed OR is it mandatory to have Hadoop to write the output from Standalone cluster mode.
 
Please suggest.
 
Thanks 
Stuti Awasthi
 




::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,

lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents

(with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates.

Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the

views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification,

distribution and / or publication of this message without the prior written consent of authorized representative of

HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.

Before opening any email and/or attachments, please check them for viruses and other defects.











-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Jacek Laskowski
Hi,

What happens when you create the parent directory /home/stuti? I think the
failure is due to missing parent directories. What's the OS?

Jacek
On 24 May 2016 11:27 a.m., "Stuti Awasthi"  wrote:

Hi All,

I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2
Slaves. Also Im not having Hadoop as filesystem . Now, Im able to launch
shell , read the input file from local filesystem and perform
transformation successfully. When I try to write my output in local
filesystem path then I receive below error .



I tried to search on web and found similar Jira :
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows
resolved for Spark 1.3+ but already people have posted the same issue still
persists in latest versions.



*ERROR*

scala> data.saveAsTextFile("/home/stuti/test1")

16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2,
server1): java.io.IOException: The temporary job-output directory
file:/home/stuti/test1/_temporary doesn't exist!

at
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)

at
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)

at
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)

at
org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)

at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)

at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:89)

at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:745)



What is the best way to resolve this issue if suppose I don’t want to have
Hadoop installed OR is it mandatory to have Hadoop to write the output from
Standalone cluster mode.



Please suggest.



Thanks 

Stuti Awasthi





::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and
intended for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as
information could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability
on the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction,
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses
and other defects.




Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Mich Talebzadeh
fine give me an example where you have tried to turn on async for the query
using spark sql. Your actual code.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 May 2016 at 12:25, Raju Bairishetti  wrote:

> Hi Mich,
>
>   Thanks for the response.
>
> yes, I do not want to block until the hive query is completed and want to
> know is there any way to poll the status/progress of submitted query.
>
> I can turn on asyc mode for hive queries in spark sql but  how to track
> the status of the submitted query?
>
> On Tue, May 24, 2016 at 6:48 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> By Hive queries in async mode, you mean submitting sql queries to Hive
>> and move on to the next operation and wait for return of result set from
>> Hive?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 24 May 2016 at 11:26, Raju Bairishetti  wrote:
>>
>>> Any thoughts on this?
>>>
>>> In hive, it returns operation handle. This handle can be used for
>>> fetching the status of query.  Is there any similar mechanism in spark sql?
>>> Looks like almost all the methods in the HiveContext are either protected
>>> or private.
>>>
>>> On Wed, May 18, 2016 at 9:03 AM, Raju Bairishetti 
>>> wrote:
>>>
 I am using spark sql for running hive queries also. Is there any way to
 run hive queries in asyc mode using spark sql.

 Does it return any hive handle or if yes how to get the results from
 hive handle using spark sql?

 --
 Thanks,
 Raju Bairishetti,

 www.lazada.com



>>>
>>>
>>> --
>>> Thanks,
>>> Raju Bairishetti,
>>>
>>> www.lazada.com
>>>
>>>
>>>
>>
>
>
> --
> Thanks,
> Raju Bairishetti,
>
> www.lazada.com
>
>
>


Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Raju Bairishetti
Hi Mich,

  Thanks for the response.

yes, I do not want to block until the hive query is completed and want to
know is there any way to poll the status/progress of submitted query.

I can turn on asyc mode for hive queries in spark sql but  how to track the
status of the submitted query?

On Tue, May 24, 2016 at 6:48 PM, Mich Talebzadeh 
wrote:

> By Hive queries in async mode, you mean submitting sql queries to Hive and
> move on to the next operation and wait for return of result set from Hive?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 24 May 2016 at 11:26, Raju Bairishetti  wrote:
>
>> Any thoughts on this?
>>
>> In hive, it returns operation handle. This handle can be used for
>> fetching the status of query.  Is there any similar mechanism in spark sql?
>> Looks like almost all the methods in the HiveContext are either protected
>> or private.
>>
>> On Wed, May 18, 2016 at 9:03 AM, Raju Bairishetti 
>> wrote:
>>
>>> I am using spark sql for running hive queries also. Is there any way to
>>> run hive queries in asyc mode using spark sql.
>>>
>>> Does it return any hive handle or if yes how to get the results from
>>> hive handle using spark sql?
>>>
>>> --
>>> Thanks,
>>> Raju Bairishetti,
>>>
>>> www.lazada.com
>>>
>>>
>>>
>>
>>
>> --
>> Thanks,
>> Raju Bairishetti,
>>
>> www.lazada.com
>>
>>
>>
>


-- 
Thanks,
Raju Bairishetti,

www.lazada.com


Using HiveContext.set in multipul threads

2016-05-24 Thread Amir Gershman
Hi,

I have a DataFrame I compute from a long chain of transformations.
I cache it, and then perform two additional transformations on it.
I use two Futures - each Future will insert the content of one of the above 
Dataframe to a different hive table.
One Future must SET hive.exec.dynamic.partition=true and the other must set it 
to false.


How can I run both INSERT commands in parallel, but guarantee each runs with 
its own settings?


If I don't use the same HiveContext then the initial long chain of 
transformations which I cache is not reusable between HiveContexts. If I use 
the same HiveContext, race conditions between threads my cause one INSERT to 
execute with the wrong config.



Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Mich Talebzadeh
By Hive queries in async mode, you mean submitting sql queries to Hive and
move on to the next operation and wait for return of result set from Hive?

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 May 2016 at 11:26, Raju Bairishetti  wrote:

> Any thoughts on this?
>
> In hive, it returns operation handle. This handle can be used for fetching
> the status of query.  Is there any similar mechanism in spark sql? Looks
> like almost all the methods in the HiveContext are either protected or
> private.
>
> On Wed, May 18, 2016 at 9:03 AM, Raju Bairishetti 
> wrote:
>
>> I am using spark sql for running hive queries also. Is there any way to
>> run hive queries in asyc mode using spark sql.
>>
>> Does it return any hive handle or if yes how to get the results from hive
>> handle using spark sql?
>>
>> --
>> Thanks,
>> Raju Bairishetti,
>>
>> www.lazada.com
>>
>>
>>
>
>
> --
> Thanks,
> Raju Bairishetti,
>
> www.lazada.com
>
>
>


Re: How to run hive queries in async mode using spark sql

2016-05-24 Thread Raju Bairishetti
Any thoughts on this?

In hive, it returns operation handle. This handle can be used for fetching
the status of query.  Is there any similar mechanism in spark sql? Looks
like almost all the methods in the HiveContext are either protected or
private.

On Wed, May 18, 2016 at 9:03 AM, Raju Bairishetti 
wrote:

> I am using spark sql for running hive queries also. Is there any way to
> run hive queries in asyc mode using spark sql.
>
> Does it return any hive handle or if yes how to get the results from hive
> handle using spark sql?
>
> --
> Thanks,
> Raju Bairishetti,
>
> www.lazada.com
>
>
>


-- 
Thanks,
Raju Bairishetti,

www.lazada.com


Not able to write output to local filsystem from Standalone mode.

2016-05-24 Thread Stuti Awasthi
Hi All,
I have 3 nodes Spark 1.6 Standalone mode cluster with 1 Master and 2 Slaves. 
Also Im not having Hadoop as filesystem . Now, Im able to launch shell , read 
the input file from local filesystem and perform transformation successfully. 
When I try to write my output in local filesystem path then I receive below 
error .

I tried to search on web and found similar Jira : 
https://issues.apache.org/jira/browse/SPARK-2984 . Even though it shows 
resolved for Spark 1.3+ but already people have posted the same issue still 
persists in latest versions.

ERROR
scala> data.saveAsTextFile("/home/stuti/test1")
16/05/24 05:03:42 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
server1): java.io.IOException: The temporary job-output directory 
file:/home/stuti/test1/_temporary doesn't exist!
at 
org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
at 
org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:116)
at org.apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:91)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1193)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1185)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

What is the best way to resolve this issue if suppose I don't want to have 
Hadoop installed OR is it mandatory to have Hadoop to write the output from 
Standalone cluster mode.

Please suggest.

Thanks 
Stuti Awasthi



::DISCLAIMER::


The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only.
E-mail transmission is not guaranteed to be secure or error-free as information 
could be intercepted, corrupted,
lost, destroyed, arrive late or incomplete, or may contain viruses in 
transmission. The e mail and its contents
(with or without referred errors) shall therefore not attach any liability on 
the originator or HCL or its affiliates.
Views or opinions, if any, presented in this email are solely those of the 
author and may not necessarily reflect the
views or opinions of HCL or its affiliates. Any form of reproduction, 
dissemination, copying, disclosure, modification,
distribution and / or publication of this message without the prior written 
consent of authorized representative of
HCL is strictly prohibited. If you have received this email in error please 
delete it and notify the sender immediately.
Before opening any email and/or attachments, please check them for viruses and 
other defects.




Re: Spark JOIN Not working

2016-05-24 Thread Alonso Isidoro Roman
Could you share a bit of the dataset? difficult to test without data...

Alonso Isidoro Roman
[image: https://]about.me/alonso.isidoro.roman


2016-05-24 8:43 GMT+02:00 Aakash Basu :

> Hi experts,
>
> I'm extremely new to the Spark Ecosystem, hence need a help from you guys.
> While trying to fetch data from CSV files and join querying them in
> accordance to the need, when I'm caching the data by using
> registeredTempTables and then using select query to select what I need as
> per the given condition, I'm getting the data. BUT when I'm trying to do
> the same query using JOIN, I'm getting zero results.
>
> Both the codes attached are same, except a few differences, like the
> Running_Code.scala uses the Select Query and the
> ProductHierarchy_Dimension.scala uses the JOIN queries.
>
> Please help me out in this. Stuck for two long days.
>
> Thanks,
> Aakash.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: Spark Streaming with Redis

2016-05-24 Thread Pariksheet Barapatre
Thanks Sachin. Link that you mentioned uses native connection library
JedisPool.

I am looking if I can use https://github.com/RedisLabs/spark-redis for same
functionality.

Regards
Pari

On 24 May 2016 at 13:33, Sachin Aggarwal  wrote:

> Hi,
>
> yahoo benchmark uses redis with spark,
>
> have a look at this
>
>
> https://github.com/yahoo/streaming-benchmarks/blob/master/spark-benchmarks/src/main/scala/AdvertisingSpark.scala
>
> On Tue, May 24, 2016 at 1:28 PM, Pariksheet Barapatre <
> pbarapa...@gmail.com> wrote:
>
>> Hello All,
>>
>> I am trying to use Redis as a data store for one of sensor data use cases
>> and I am fairly new to Redis.
>>
>> I guess spark-redis module can help me, but I am not getting how to use
>> INCR or HINCRBY redis functions.
>>
>> Could you please help me to get some example codes or any pointers to
>> solve this problem.
>>
>> Thanks in advance.
>>
>> Cheers,
>> Pari
>>
>
>
>
> --
>
> Thanks & Regards
>
> Sachin Aggarwal
> 7760502772
>



-- 
Cheers,
Pari


Re: Spark Streaming with Redis

2016-05-24 Thread Sachin Aggarwal
Hi,

yahoo benchmark uses redis with spark,

have a look at this

https://github.com/yahoo/streaming-benchmarks/blob/master/spark-benchmarks/src/main/scala/AdvertisingSpark.scala

On Tue, May 24, 2016 at 1:28 PM, Pariksheet Barapatre 
wrote:

> Hello All,
>
> I am trying to use Redis as a data store for one of sensor data use cases
> and I am fairly new to Redis.
>
> I guess spark-redis module can help me, but I am not getting how to use
> INCR or HINCRBY redis functions.
>
> Could you please help me to get some example codes or any pointers to
> solve this problem.
>
> Thanks in advance.
>
> Cheers,
> Pari
>



-- 

Thanks & Regards

Sachin Aggarwal
7760502772


Spark Streaming with Redis

2016-05-24 Thread Pariksheet Barapatre
Hello All,

I am trying to use Redis as a data store for one of sensor data use cases
and I am fairly new to Redis.

I guess spark-redis module can help me, but I am not getting how to use
INCR or HINCRBY redis functions.

Could you please help me to get some example codes or any pointers to solve
this problem.

Thanks in advance.

Cheers,
Pari


Re: Using Spark on Hive with Hive also using Spark as its execution engine

2016-05-24 Thread Mich Talebzadeh
Hi,

We use Hive as the database and use Spark as an all purpose query tool.

Whether Hive is the write database for purpose or one is better off with
something like Phoenix on Hbase, well the answer is it depends and your
mileage varies.

So fit for purpose.

Ideally what wants is to use the fastest  method to get the results. How
fast we confine it to our SLA agreements in production and that helps us
from unnecessary further work as we technologists like to play around.

So in short, we use Spark most of the time and use Hive as the backend
engine for data storage, mainly ORC tables.

We use Hive on Spark and with Hive 2 on Spark 1.3.1 for now we have a
combination that works. Granted it helps to use Hive 2 on Spark 1.6.1 but
at the moment it is one of my projects.

We do not use any vendor's products as it enables us to move away  from
being tied down after years of SAP, Oracle and MS dependency to yet another
vendor. Besides there is some politics going on with one promoting Tez and
another Spark as a backend. That is fine but obviously we prefer an
independent assessment ourselves.

My gut feeling is that one needs to look at the use case. Recently we had
to import a very large table from Oracle to Hive and decided to use Spark
1.6.1 with Hive 2 on Spark 1.3.1 and that worked fine. We just used JDBC
connection with temp table and it was good. We could have used sqoop but
decided to settle for Spark so it all depends on use case.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 24 May 2016 at 03:11, ayan guha  wrote:

> Hi
>
> Thanks for very useful stats.
>
> Did you have any benchmark for using Spark as backend engine for Hive vs
> using Spark thrift server (and run spark code for hive queries)? We are
> using later but it will be very useful to remove thriftserver, if we can.
>
> On Tue, May 24, 2016 at 9:51 AM, Jörn Franke  wrote:
>
>>
>> Hi Mich,
>>
>> I think these comparisons are useful. One interesting aspect could be
>> hardware scalability in this context. Additionally different type of
>> computations. Furthermore, one could compare Spark and Tez+llap as
>> execution engines. I have the gut feeling that  each one can be justified
>> by different use cases.
>> Nevertheless, there should be always a disclaimer for such comparisons,
>> because Spark and Hive are not good for a lot of concurrent lookups of
>> single rows. They are not good for frequently write small amounts of data
>> (eg sensor data). Here hbase could be more interesting. Other use cases can
>> justify graph databases, such as Titan, or text analytics/ data matching
>> using Solr on Hadoop.
>> Finally, even if you have a lot of data you need to think if you always
>> have to process everything. For instance, I have found valid use cases in
>> practice where we decided to evaluate 10 machine learning models in
>> parallel on only a sample of data and only evaluate the "winning" model of
>> the total of data.
>>
>> As always it depends :)
>>
>> Best regards
>>
>> P.s.: at least Hortonworks has in their distribution spark 1.5 with hive
>> 1.2 and spark 1.6 with hive 1.2. Maybe they have somewhere described how to
>> manage bringing both together. You may check also Apache Bigtop (vendor
>> neutral distribution) on how they managed to bring both together.
>>
>> On 23 May 2016, at 01:42, Mich Talebzadeh 
>> wrote:
>>
>> Hi,
>>
>>
>>
>> I have done a number of extensive tests using Spark-shell with Hive DB
>> and ORC tables.
>>
>>
>>
>> Now one issue that we typically face is and I quote:
>>
>>
>>
>> Spark is fast as it uses Memory and DAG. Great but when we save data it
>> is not fast enough
>>
>> OK but there is a solution now. If you use Spark with Hive and you are on
>> a descent version of Hive >= 0.14, then you can also deploy Spark as
>> execution engine for Hive. That will make your application run pretty fast
>> as you no longer rely on the old Map-Reduce for Hive engine. In a nutshell
>> what you are gaining speed in both querying and storage.
>>
>>
>>
>> I have made some comparisons on this set-up and I am sure some of you
>> will find it useful.
>>
>>
>>
>> The version of Spark I use for Spark queries (Spark as query tool) is 1.6.
>>
>> The version of Hive I use in Hive 2
>>
>> The version of Spark I use as Hive execution engine is 1.3.1 It works and
>> frankly Spark 1.3.1 as an execution engine is adequate (until we sort out
>> the Hadoop libraries mismatch).
>>
>>
>>
>> An example I am using Hive on Spark engine to find the min and max of IDs
>> for a table with 1 billion rows:
>>
>>
>>
>> 0: jdbc:hive2://rhes564:10010/default>  select min(id), max(id),avg(id),
>> stddev(id) from oraclehadoop.dummy;
>>
>> Query ID = 

Spark JOIN Not working

2016-05-24 Thread Aakash Basu
Hi experts,

I'm extremely new to the Spark Ecosystem, hence need a help from you guys.
While trying to fetch data from CSV files and join querying them in
accordance to the need, when I'm caching the data by using
registeredTempTables and then using select query to select what I need as
per the given condition, I'm getting the data. BUT when I'm trying to do
the same query using JOIN, I'm getting zero results.

Both the codes attached are same, except a few differences, like the
Running_Code.scala uses the Select Query and the
ProductHierarchy_Dimension.scala uses the JOIN queries.

Please help me out in this. Stuck for two long days.

Thanks,
Aakash.


ProductHierarchy_Dimension.scala
Description: Binary data


Running_Code.scala
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org