Spark driver thread

2020-03-05 Thread James Yu
Hi,

Does a Spark driver always works as single threaded?

If yes, does it mean asking for more than one vCPU for the driver is wasteful?


Thanks,
James


Re: Can't get Spark to interface with S3A Filesystem with correct credentials

2020-03-05 Thread Devin Boyer
Thanks for the input Steven and Hariharan. I think this ended up being a
combination of bad configuration with the credential providers I was using
*and* using the wrong set of credentials for the test data I was trying to
access.

I was able to get this working with both hadoop 2.8 and 3.1 by pulling down
the correct hadoop-aws and aws-java-sdk[-bundle] bundles and fixing the
credential provider I was using for testing. It's probably the same for the
spark distribution compiled for hadoop 2.7, but since I already have a
build with a more modern hadoop version working, I may just stick with that.

Best,
Devin

On Wed, Mar 4, 2020 at 11:02 PM Hariharan  wrote:

> If you're using hadoop 2.7 or below, you may also need to use the
> following hadoop settings:
>
> fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
> fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
> fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A
>
> Hadoop 2.8 and above would have these set by default.
>
> Thanks,
> Hariharan
>
> On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
>  wrote:
> >
> > Hello,
> >
> > I'm attempting to run Spark within a Docker container with the hope of
> eventually running Spark on Kubernetes. Nearly all the data we currently
> process with Spark is stored in S3, so I need to be able to interface with
> it using the S3A filesystem.
> >
> > I feel like I've gotten close to getting this working but for some
> reason cannot get my local Spark installations to correctly interface with
> S3 yet.
> >
> > A basic example of what I've tried:
> >
> > Build Kubernetes docker images by downloading the
> spark-2.4.5-bin-hadoop2.7.tgz archive and building the
> kubernetes/dockerfiles/spark/Dockerfile image.
> > Run an interactive docker container using the above built image.
> > Within that container, run spark-shell. This command passes valid AWS
> credentials by setting spark.hadoop.fs.s3a.access.key and
> spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the
> hadoop-aws package by specifying the --packages
> org.apache.hadoop:hadoop-aws:2.7.3 flag.
> > Try to access the simple public file as outlined in the "Integration
> with Cloud Infrastructures" documentation by running:
> sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> > Observe this to fail with a 403 Forbidden exception thrown by S3
> >
> >
> > I've tried a variety of other means of setting credentials (like
> exporting the standard AWS_ACCESS_KEY_ID environment variable before
> launching spark-shell), and other means of building a Spark image and
> including the appropriate libraries (see this Github repo:
> https://github.com/drboyer/spark-s3a-demo), all with the same results.
> I've tried also accessing objects within our AWS account, rather than the
> object from the public landsat-pds bucket, with the same 403 error being
> thrown.
> >
> > Can anyone help explain why I can't seem to connect to S3 successfully
> using Spark, or even explain where I could look for additional clues as to
> what's misconfigured? I've tried turning up the logging verbosity and
> didn't see much that was particularly useful, but happy to share additional
> log output too.
> >
> > Thanks for any help you can provide!
> >
> > Best,
> > Devin Boyer
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Hostname :BUG

2020-03-05 Thread Zahid Rahman
Talking about copy and paste
Larry Tesler The *inventor* of *cut*/*copy* & *paste*, find & replace
past away last week age 74.

Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org



On Thu, 5 Mar 2020 at 07:01, Zahid Rahman  wrote:

> Please explain why you think that if there is a different reason from this
> : -
>
> If you think that, because the header of /etc/hostname says hosts then
> that is because I copied the file header from /etc/hosts to  /etc/hostname.
>
>
>
>
> On Wed, 4 Mar 2020, 21:14 Andrew Melo,  wrote:
>
>> Hello Zabid,
>>
>> On Wed, Mar 4, 2020 at 1:47 PM Zahid Rahman  wrote:
>>
>>> Hi,
>>>
>>> I found the problem was because on my  Linux   Operating System the
>>> /etc/hostname was blank.
>>>
>>> *STEP 1*
>>> I searched  on google the error message and there was an answer
>>> suggesting
>>> I should add to /etc/hostname
>>>
>>> 127.0.0.1  [hostname] localhost.
>>>
>>
>> I believe you've confused /etc/hostname and /etc/hosts --
>>
>>
>>>
>>> I did that but there was still  an error,  this time the spark  log in
>>> standard output was concatenating the text content
>>> of etc/hostname  like so ,   127.0.0.1[hostname]localhost.
>>>
>>> *STEP 2*
>>> My second attempt was to change the /etc/hostname to 127.0.0.1
>>> This time I was getting a warning with information about "using loop
>>> back"  rather than an error.
>>>
>>> *STEP 3*
>>> I wasn't happy with that so then I changed the /etc/hostname to (see
>>> below) ,
>>> then the warning message disappeared. my guess is that it is the act of
>>> creating spark session as to the cause of error,
>>> in SparkConf() API.
>>>
>>>  SparkConf sparkConf = new SparkConf()
>>>  .setAppName("Simple Application")
>>>  .setMaster("local")
>>>  .set("spark.executor.memory","2g");
>>>
>>> $ cat /etc/hostname
>>> # hosts This file describes a number of hostname-to-address
>>> #   mappings for the TCP/IP subsystem.  It is mostly
>>> #   used at boot time, when no name servers are running.
>>> #   On small systems, this file can be used instead of a
>>> #   "named" name server.
>>> # Syntax:
>>> #
>>> # IP-Address  Full-Qualified-Hostname  Short-Hostname
>>> #
>>>
>>> 192.168.0.42
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> zahid@localhost
>>> :~/Downloads/apachespark/Apache-Spark-Example/Java-Code-Geek>
>>> mvn exec:java -Dexec.mainClass=com.javacodegeek.examples.SparkExampleRDD
>>> -Dexec.args="input.txt"
>>> [INFO] Scanning for projects...
>>> [WARNING]
>>> [WARNING] Some problems were encountered while building the effective
>>> model for javacodegeek:examples:jar:1.0-SNAPSHOT
>>> [WARNING] 'build.plugins.plugin.version' for
>>> org.apache.maven.plugins:maven-compiler-plugin is missing. @ line 12,
>>> column 21
>>> [WARNING]
>>> [WARNING] It is highly recommended to fix these problems because they
>>> threaten the stability of your build.
>>> [WARNING]
>>> [WARNING] For this reason, future Maven versions might no longer support
>>> building such malformed projects.
>>> [WARNING]
>>> [INFO]
>>> [INFO] ---< javacodegeek:examples
>>> >
>>> [INFO] Building examples 1.0-SNAPSHOT
>>> [INFO] [ jar
>>> ]-
>>> [INFO]
>>> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ examples ---
>>> WARNING: An illegal reflective access operation has occurred
>>> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform
>>> (file:/home/zahid/.m2/repository/org/apache/spark/spark-unsafe_2.12/2.4.5/spark-unsafe_2.12-2.4.5.jar)
>>> to method java.nio.Bits.unaligned()
>>> WARNING: Please consider reporting this to the maintainers of
>>> org.apache.spark.unsafe.Platform
>>> WARNING: Use --illegal-access=warn to enable warnings of further illegal
>>> reflective access operations
>>> WARNING: All illegal access operations will be denied in a future release
>>> Using Spark's default log4j profile:
>>> org/apache/spark/log4j-defaults.properties
>>> 20/02/29 17:20:40 INFO SparkContext: Running Spark version 2.4.5
>>> 20/02/29 17:20:40 WARN NativeCodeLoader: Unable to load native-hadoop
>>> library for your platform... using builtin-java classes where applicable
>>> 20/02/29 17:20:41 INFO SparkContext: Submitted application: Word Count
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls to: zahid
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls to: zahid
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing view acls groups to:
>>> 20/02/29 17:20:41 INFO SecurityManager: Changing modify acls groups to:
>>> 20/02/29 17:20:41 INFO SecurityManager: SecurityManager: authentication
>>> disabled; ui acls disabled; users  with view permissions: Set(zahid);
>>> groups with view permissions: Set(); users  with modify permissions:
>>> Set(zahid); groups with modify permissions: Set()
>>> 

Re: Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-05 Thread Something Something
Yes that was it! It seems it only works if input data is continuously
flowing. I had stopped the input job because I had enough data but it seems
timeouts work only if the data is continuously fed. Not sure why it's
designed that way. Makes it a bit harder to write unit/integration tests
BUT I am sure there's a reason why it's designed this way. Thanks.

On Wed, Mar 4, 2020 at 6:31 PM Tathagata Das 
wrote:

> Make sure that you are continuously feeding data into the query to trigger
> the batches. only then timeouts are processed.
> See the timeout behavior details here -
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.GroupState
>
> On Wed, Mar 4, 2020 at 2:51 PM Something Something <
> mailinglist...@gmail.com> wrote:
>
>> I've set the timeout duration to "2 minutes" as follows:
>>
>> def updateAcrossEvents (tuple3: Tuple3[String, String, String], inputs: 
>> Iterator[R00tJsonObject],
>>   oldState: GroupState[MyState]): OutputRow = {
>>
>> println(" Inside updateAcrossEvents with : " + tuple3._1 + ", " + 
>> tuple3._2 + ", " + tuple3._3)
>> var state: MyState = if (oldState.exists) oldState.get else 
>> MyState(tuple3._1, tuple3._2, tuple3._3)
>>
>> if (oldState.hasTimedOut) {
>>   println("@ oldState has timed out ")
>>   // Logic to Write OutputRow
>>   OutputRow("some values here...")
>> } else {
>>   for (input <- inputs) {
>> state = updateWithEvent(state, input)
>> oldState.update(state)
>> *oldState.setTimeoutDuration("2 minutes")*
>>   }
>>   OutputRow(null, null, null)
>> }
>>
>>   }
>>
>> I have also specified ProcessingTimeTimeout in 'mapGroupsWithState' as 
>> follows...
>>
>> .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout())(updateAcrossEvents)
>>
>> But 'hasTimedOut' is never true so I don't get any output! What am I doing 
>> wrong?
>>
>>
>>
>>


Re: Read Hive ACID Managed table in Spark

2020-03-05 Thread venkata naidu udamala
You can try using have warehouse connector
https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehouseconnector_for_handling_apache_spark_data.html

On Thu, Mar 5, 2020, 6:51 AM Chetan Khatri 
wrote:

> Just followup, if anyone has worried on this before
>
> On Wed, Mar 4, 2020 at 12:09 PM Chetan Khatri 
> wrote:
>
>> Hi Spark Users,
>> I want to read Hive ACID managed table data (ORC) in Spark. Can someone
>> help me here.
>> I've tried, https://github.com/qubole/spark-acid but no success.
>>
>> Thanks
>>
>


Re: Read Hive ACID Managed table in Spark

2020-03-05 Thread Chetan Khatri
Just followup, if anyone has worried on this before

On Wed, Mar 4, 2020 at 12:09 PM Chetan Khatri 
wrote:

> Hi Spark Users,
> I want to read Hive ACID managed table data (ORC) in Spark. Can someone
> help me here.
> I've tried, https://github.com/qubole/spark-acid but no success.
>
> Thanks
>


Re: SPARK Suitable IDE

2020-03-05 Thread Nicolas Paris


Holden Karau  writes:
> I work in emacs with ensime.

the ensime project was stoped and the project archived. its successor
"metals" works well for scala >=2.12

any good ressource to setup ensime with emacs ? can't wait overall spark
community goes on scala 2.12

--
nicolas paris

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SPARK Suitable IDE

2020-03-05 Thread Zahid Rahman
There are indications on the Internet that Jupyter Notebook offers an
advantage when working with SPARK technologies.

I was wondering if  there was any substance to these claims , if there was
substance to the claims then I would proceed to get  comfortable with
Jupyter Notebook.

Regards

Zahid

On Thu, 5 Mar 2020, 02:03 Holden Karau,  wrote:

> I work in emacs with ensime. I think really any IDE is ok, so go with the
> one you feel most at home in.
>
> On Wed, Mar 4, 2020 at 5:49 PM tianlangstudio
>  wrote:
>
>> We use IntelliJ IDEA,Whether it's Java, Scala or Python
>>
>> 
>> 
>> 
>> 
>> 
>> 
>> TianlangStudio 
>> Some of the biggest lies: I will start tomorrow/Others are better than
>> me/I am not good enough/I don't have time/This is the way I am
>> 
>>
>>
>> --
>> 发件人:Zahid Rahman 
>> 发送时间:2020年3月3日(星期二) 06:43
>> 收件人:user 
>> 主 题:SPARK Suitable IDE
>>
>> Hi,
>>
>> Can you recommend a suitable IDE for Apache sparks from the list below or
>> if you know a more suitable one ?
>>
>> Codeanywhere
>> goormIDE
>> Koding
>> SourceLair
>> ShiftEdit
>> Browxy
>> repl.it
>> PaizaCloud IDE
>> Eclipse Che
>> Visual Studio Online
>> Gitpod
>> Google Cloud Shell
>> Codio
>> Codepen
>> CodeTasty
>> Glitch
>> JSitor
>> ICEcoder
>> Codiad
>> Dirigible
>> Orion
>> Codiva.io
>> Collide
>> Codenvy
>> AWS Cloud9
>> JSFiddle
>> GitLab
>> SLAppForge Sigma
>> Jupyter
>> CoCalc
>>
>> Backbutton.co.uk
>> ¯\_(ツ)_/¯
>> ♡۶Java♡۶RMI ♡۶
>> Make Use Method {MUM}
>> makeuse.org
>> 
>>
>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>