Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Natu Lauchande
Hi,

I think the spark UI will be accessible whenever you launch a spark app in
the cluster it should be the Application Tracker link.


Regards,
Natu

On Tue, Sep 13, 2016 at 9:37 AM, Divya Gehlot 
wrote:

> Hi ,
> Thank you all..
> Hurray ...I am able to view the hadoop web UI now  @ 8088 . even Spark
> Hisroty server Web UI @ 18080
> But unable to figure out the Spark UI web port ...
> Tried with 4044,4040.. ..
> getting below error
> This site can’t be reached
> How can I find out the Spark port ?
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>
>
> On 13 September 2016 at 15:09, Divya Gehlot 
> wrote:
>
>> Hi,
>> Thanks all for your prompt response.
>> I followed the instruction in the docs EMR SSH tunnel
>> 
>> shared by Jonathan.
>> I am on MAC and set up foxy proxy in my chrome browser
>>
>> Divyas-MacBook-Pro:.ssh divyag$ ssh  -N -D 8157
>> had...@ec2-xx-xxx-xxx-xx.ap-southeast-1.compute.amazonaws.com
>>
>> channel 3: open failed: connect failed: Connection refused
>>
>> channel 3: open failed: connect failed: Connection refused
>>
>> channel 4: open failed: connect failed: Connection refused
>>
>> channel 3: open failed: connect failed: Connection refused
>>
>> channel 4: open failed: connect failed: Connection refused
>>
>> channel 3: open failed: connect failed: Connection refused
>>
>> channel 3: open failed: connect failed: Connection refused
>>
>> channel 4: open failed: connect failed: Connection refused
>>
>> channel 5: open failed: connect failed: Connection refused
>>
>> channel 22: open failed: connect failed: Connection refused
>>
>> channel 23: open failed: connect failed: Connection refused
>>
>> channel 22: open failed: connect failed: Connection refused
>>
>> channel 23: open failed: connect failed: Connection refused
>>
>> channel 22: open failed: connect failed: Connection refused
>>
>> channel 8: open failed: administratively prohibited: open failed
>>
>>
>> What am I missing now ?
>>
>>
>> Thanks,
>>
>> Divya
>>
>> On 13 September 2016 at 14:23, Jonathan Kelly 
>> wrote:
>>
>>> I would not recommend opening port 50070 on your cluster, as that would
>>> give the entire world access to your data on HDFS. Instead, you should
>>> follow the instructions found here to create a secure tunnel to the
>>> cluster, through which you can proxy requests to the UIs using a browser
>>> plugin like FoxyProxy: https://docs.aws.amazon.com/ElasticMapReduce/late
>>> st/ManagementGuide/emr-ssh-tunnel.html
>>>
>>> ~ Jonathan
>>>
>>> On Mon, Sep 12, 2016 at 10:40 PM Mohammad Tariq 
>>> wrote:
>>>
 Hi Divya,

 Do you you have inbounds enabled on port 50070 of your NN machine.
 Also, it's a good idea to have the public DNS in your /etc/hosts for proper
 name resolution.


 [image: --]

 Tariq, Mohammad
 [image: https://]about.me/mti

 




 [image: http://] 
 Tariq, Mohammad
 about.me/mti
 [image: http://]
 

 On Tue, Sep 13, 2016 at 9:28 AM, Divya Gehlot 
 wrote:

> Hi,
> I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
> When I am trying to view Any of the web UI of the cluster either
> hadoop or Spark ,I am getting below error
> "
> This site can’t be reached
>
> "
> Has anybody using EMR and able to view WebUI .
> Could you please share the steps.
>
> Would really appreciate the help.
>
> Thanks,
> Divya
>


>>
>


Can Spark Streaming checkpoint only metadata ?

2016-06-21 Thread Natu Lauchande
Hi,

I wonder if it is possible to checkpoint only metadata and not the data in
RDD's and dataframes.

Thanks,
Natu


Spark not using all the cluster instances in AWS EMR

2016-06-18 Thread Natu Lauchande
Hi,

I am running some spark loads . I notice that in  it only uses one of the
machines(instead of the 3 available) of the cluster.

Is there any parameter that can be set to force it to use all the cluster.

I am using AWS EMR with Yarn.


Thanks,
Natu


RE: difference between dataframe and dataframwrite

2016-06-16 Thread Natu Lauchande
Hi

Does anyone know wich one aws emr uses by default?

Thanks,
Natu
On Jun 16, 2016 5:12 PM, "David Newberger" 
wrote:

> DataFrame is a collection of data which is organized into named columns.
>
> DataFrame.write is an interface for saving the contents of a DataFrame to
> external storage.
>
>
>
> Hope this helps
>
>
>
> *David Newberger*
>
>
>
>
>
> *From:* pseudo oduesp [mailto:pseudo20...@gmail.com]
> *Sent:* Thursday, June 16, 2016 9:43 AM
> *To:* user@spark.apache.org
> *Subject:* difference between dataframe and dataframwrite
>
>
>
> hi,
>
>
>
> what is difference between dataframe and dataframwrite ?
>
>
>


Re: concat spark dataframes

2016-06-15 Thread Natu Lauchande
Hi,

You can select the common collumns and use DataFrame.union all .

Regards,
Natu

On Wed, Jun 15, 2016 at 8:57 PM, spR  wrote:

> hi,
>
> how to concatenate spark dataframes? I have 2 frames with certain columns.
> I want to get a dataframe with columns from both the other frames.
>
> Regards,
> Misha
>


Re: Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi,

It seems to me that the checkpoint command is not persisting the
SparkContext hadoop configuration correctly . Can this be a possibility ?



Thanks,
Natu

On Mon, Jun 13, 2016 at 11:57 AM, Natu Lauchande <nlaucha...@gmail.com>
wrote:

> Hi,
>
> I am testing disaster recovery from Spark and having some issues when
> trying to restore an input file from s3 :
>
> 2016-06-13 11:42:52,420 [main] INFO
> org.apache.spark.streaming.dstream.FileInputDStream$FileInputDStreamCheckpointData
> - Restoring files for time 146581086 ms -
> [s3n://bucketfoo/filefoo/908966c353654a21bc7b2733d65b7c19_availability_146390040.csv]
> Exception in thread "main" java.lang.IllegalArgumentException: AWS Access
> Key ID and Secret Access Key must be specified as the username or password
> (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
> fs.s3n.awsSecretAccessKey properties (respectively).
>
>
> I am basically following the pattern
> https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala
> . And added on the stream creation the environment variables.
>
> Can anyone in the list help me on figuring out why i am having this error ?
>
> Thanks,
> Natu
>
>
>


Spark Streamming checkpoint and restoring files from S3

2016-06-13 Thread Natu Lauchande
Hi,

I am testing disaster recovery from Spark and having some issues when
trying to restore an input file from s3 :

2016-06-13 11:42:52,420 [main] INFO
org.apache.spark.streaming.dstream.FileInputDStream$FileInputDStreamCheckpointData
- Restoring files for time 146581086 ms -
[s3n://bucketfoo/filefoo/908966c353654a21bc7b2733d65b7c19_availability_146390040.csv]
Exception in thread "main" java.lang.IllegalArgumentException: AWS Access
Key ID and Secret Access Key must be specified as the username or password
(respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or
fs.s3n.awsSecretAccessKey properties (respectively).


I am basically following the pattern
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala
. And added on the stream creation the environment variables.

Can anyone in the list help me on figuring out why i am having this error ?

Thanks,
Natu


Issues when using the streaming checkpoint

2016-06-09 Thread Natu Lauchande
Hi,

I am having the following error when using checkpoint in a spark streamming
app :

java.io.NotSerializableException: DStream checkpointing has been enabled
but the DStreams with their functions are not serializable

I am following the example available in
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala
with Spark 1.6 .

I wonder if there is any one that can share experiences on handling this
kind of errors when using the checkpoint feature of spark streamming.

Thanks,
Natu


Re: My notes on Spark Performance & Tuning Guide

2016-05-17 Thread Natu Lauchande
Hi Mich,

I am also interested in the write up.

Regards,
Natu

On Thu, May 12, 2016 at 12:08 PM, Mich Talebzadeh  wrote:

> Hi Al,,
>
>
> Following the threads in spark forum, I decided to write up on
> configuration of Spark including allocation of resources and configuration
> of driver, executors, threads, execution of Spark apps and general
> troubleshooting taking into account the allocation of resources for Spark
> applications and OS tools at the disposal.
>
> Since the most widespread configuration as I notice is with "Spark
> Standalone Mode", I have decided to write these notes starting with
> Standalone and later on moving to Yarn
>
>
>-
>
>*Standalone *– a simple cluster manager included with Spark that makes
>it easy to set up a cluster.
>-
>
>*YARN* – the resource manager in Hadoop 2.
>
>
> I would appreciate if anyone interested in reading and commenting to get
> in touch with me directly on mich.talebza...@gmail.com so I can send the
> write-up for their review and comments.
>
>
> Just to be clear this is not meant to be any commercial proposition or
> anything like that. As I seem to get involved with members troubleshooting
> issues and threads on this topic, I thought it is worthwhile writing a note
> about it to summarise the findings for the benefit of the community.
>
>
> Regards.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: Init/Setup worker

2016-05-10 Thread Natu Lauchande
Hi,

Not sure if this might be helpful to you :
https://github.com/ondra-m/ruby-spark .

Regards,
Natu

On Tue, May 10, 2016 at 4:37 PM, Lionel PERRIN 
wrote:

> Hello,
>
>
>
> I’m looking for a solution to use jruby on top of spark. The only tricky
> point is that I need that every worker thread has a ruby interpreter
> initialized. *Basically, I need to register a function to be called when
> each worker thread is created* : a thread local variable must be set for
> the ruby interpreter so that ruby object can be deserialized.
>
>
>
> Is there any solution to setup the worker threads before any spark call is
> made using this thread ?
>
>
> Regards,
>
>
> Lionel
>


Re: DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
Hi David,

Thanks for you answer.

I have a follow up question :

I am using textFileStream , and listening in an S3 bucket for new files to
process.  Files are created every 5 minutes and my batch interval is 2
minutes .

Does it mean that each file will be for one RDD ?

Thanks,
Natu

On Tue, Apr 12, 2016 at 7:46 PM, David Newberger <
david.newber...@wandcorp.com> wrote:

> Hi,
>
>
>
> Time is usually the criteria if I’m understanding your question. An RDD is
> created for each batch interval. If your interval is 500ms then an RDD
> would be created every 500ms. If it’s 2 seconds then an RDD is created
> every 2 seconds.
>
>
>
> Cheers,
>
>
>
> *David*
>
>
>
> *From:* Natu Lauchande [mailto:nlaucha...@gmail.com]
> *Sent:* Tuesday, April 12, 2016 7:09 AM
> *To:* user@spark.apache.org
> *Subject:* DStream how many RDD's are created by batch
>
>
>
> Hi,
>
> What's the criteria for the number of RDD's created for each micro bath
> iteration  ?
>
>
>
> Thanks,
>
> Natu
>


DStream how many RDD's are created by batch

2016-04-12 Thread Natu Lauchande
Hi,

What's the criteria for the number of RDD's created for each micro bath
iteration  ?

Thanks,
Natu


Can i have a hive context and sql context in the same app ?

2016-04-12 Thread Natu Lauchande
Hi,

Is it possible to have both a sqlContext and a hiveContext in the same
application ?

If yes would there be any performance pernalties of doing so.

Regards,
Natu


Re: Unable run Spark in YARN mode

2016-04-09 Thread Natu Lauchande
How are you trying to run spark ? locally ? spark submit ?

On Sat, Apr 9, 2016 at 7:57 AM, maheshmath  wrote:

> I have set SPARK_LOCAL_IP=127.0.0.1 still getting below error
>
> 16/04/09 10:36:50 INFO spark.SecurityManager: Changing view acls to: mahesh
> 16/04/09 10:36:50 INFO spark.SecurityManager: Changing modify acls to:
> mahesh
> 16/04/09 10:36:50 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mahesh); users with modify permissions: Set(mahesh)
> 16/04/09 10:36:51 INFO util.Utils: Successfully started service
> 'sparkDriver' on port 43948.
> 16/04/09 10:36:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
> 16/04/09 10:36:51 INFO Remoting: Starting remoting
> 16/04/09 10:36:52 INFO Remoting: Remoting started; listening on addresses
> :[akka.tcp://sparkDriverActorSystem@127.0.0.1:32792]
> 16/04/09 10:36:52 INFO util.Utils: Successfully started service
> 'sparkDriverActorSystem' on port 32792.
> 16/04/09 10:36:52 INFO spark.SparkEnv: Registering MapOutputTracker
> 16/04/09 10:36:52 INFO spark.SparkEnv: Registering BlockManagerMaster
> 16/04/09 10:36:52 INFO storage.DiskBlockManager: Created local directory at
> /tmp/blockmgr-a2079037-6bbe-49ce-ba78-d475e38ad362
> 16/04/09 10:36:52 INFO storage.MemoryStore: MemoryStore started with
> capacity 517.4 MB
> 16/04/09 10:36:52 INFO spark.SparkEnv: Registering OutputCommitCoordinator
> 16/04/09 10:36:53 INFO server.Server: jetty-8.y.z-SNAPSHOT
> 16/04/09 10:36:53 INFO server.AbstractConnector: Started
> SelectChannelConnector@0.0.0.0:4040
> 16/04/09 10:36:53 INFO util.Utils: Successfully started service 'SparkUI'
> on
> port 4040.
> 16/04/09 10:36:53 INFO ui.SparkUI: Started SparkUI at
> http://127.0.0.1:4040
> 16/04/09 10:36:53 INFO client.RMProxy: Connecting to ResourceManager at
> /0.0.0.0:8032
> 16/04/09 10:36:54 INFO yarn.Client: Requesting a new application from
> cluster with 1 NodeManagers
> 16/04/09 10:36:54 INFO yarn.Client: Verifying our application has not
> requested more than the maximum memory capability of the cluster (8192 MB
> per container)
> 16/04/09 10:36:54 INFO yarn.Client: Will allocate AM container, with 896 MB
> memory including 384 MB overhead
> 16/04/09 10:36:54 INFO yarn.Client: Setting up container launch context for
> our AM
> 16/04/09 10:36:54 INFO yarn.Client: Setting up the launch environment for
> our AM container
> 16/04/09 10:36:54 INFO yarn.Client: Preparing resources for our AM
> container
> 16/04/09 10:36:56 INFO yarn.Client: Uploading resource
>
> file:/home/mahesh/Programs/spark-1.6.1-bin-hadoop2.6/lib/spark-assembly-1.6.1-hadoop2.6.0.jar
> ->
>
> hdfs://localhost:54310/user/mahesh/.sparkStaging/application_1460137661144_0003/spark-assembly-1.6.1-hadoop2.6.0.jar
> 16/04/09 10:36:59 INFO yarn.Client: Uploading resource
>
> file:/tmp/spark-f28e3fd5-4dcd-4199-b298-c7fc607dedb4/__spark_conf__5551799952710555772.zip
> ->
>
> hdfs://localhost:54310/user/mahesh/.sparkStaging/application_1460137661144_0003/__spark_conf__5551799952710555772.zip
> 16/04/09 10:36:59 INFO spark.SecurityManager: Changing view acls to: mahesh
> 16/04/09 10:36:59 INFO spark.SecurityManager: Changing modify acls to:
> mahesh
> 16/04/09 10:36:59 INFO spark.SecurityManager: SecurityManager:
> authentication disabled; ui acls disabled; users with view permissions:
> Set(mahesh); users with modify permissions: Set(mahesh)
> 16/04/09 10:36:59 INFO yarn.Client: Submitting application 3 to
> ResourceManager
> 16/04/09 10:36:59 INFO impl.YarnClientImpl: Submitted application
> application_1460137661144_0003
> 16/04/09 10:37:00 INFO yarn.Client: Application report for
> application_1460137661144_0003 (state: ACCEPTED)
> 16/04/09 10:37:00 INFO yarn.Client:
>  client token: N/A
>  diagnostics: N/A
>  ApplicationMaster host: N/A
>  ApplicationMaster RPC port: -1
>  queue: default
>  start time: 1460178419692
>  final status: UNDEFINED
>  tracking URL:
> http://gubbi:8088/proxy/application_1460137661144_0003/
>  user: mahesh
> 16/04/09 10:37:01 INFO yarn.Client: Application report for
> application_1460137661144_0003 (state: ACCEPTED)
> 16/04/09 10:37:02 INFO yarn.Client: Application report for
> application_1460137661144_0003 (state: ACCEPTED)
> 16/04/09 10:37:03 INFO yarn.Client: Application report for
> application_1460137661144_0003 (state: ACCEPTED)
> 16/04/09 10:37:04 INFO yarn.Client: Application report for
> application_1460137661144_0003 (state: ACCEPTED)
> 16/04/09 10:37:05 INFO cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
> ApplicationMaster registered as NettyRpcEndpointRef(null)
> 16/04/09 10:37:05 INFO cluster.YarnClientSchedulerBackend: Add WebUI
> Filter.
> org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS
> -> gubbi, PROXY_URI_BASES ->
> http://gubbi:8088/proxy/application_1460137661144_0003),
> /proxy/application_1460137661144_0003
> 16/04/09 

Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Do you know if textFileStream can see if new files are created underneath a
whole bucket?
Only at the level of the folder that you specify . They don't do
subfolders. So your approach would be detecting everything under path
s3://bucket/path/2016040902_data.csv

Also, will Spark Streaming not pick up these files again on the following
run knowing that it already picked them up or do we have to store state
somewhere, like the last run date and time to compare against?
Yes it does it automatically. It will only pick newly created files , after
the streamming app is working .


Thanks,
Natu


On Sat, Apr 9, 2016 at 4:44 PM, Benjamin Kim <bbuil...@gmail.com> wrote:

> Natu,
>
> Do you know if textFileStream can see if new files are created underneath
> a whole bucket? For example, if the bucket name is incoming and new files
> underneath it are 2016/04/09/00/00/01/data.csv and
> 2016/04/09/00/00/02/data/csv, will these files be picked up? Also, will
> Spark Streaming not pick up these files again on the following run knowing
> that it already picked them up or do we have to store state somewhere, like
> the last run date and time to compare against?
>
> Thanks,
> Ben
>
> On Apr 8, 2016, at 9:15 PM, Natu Lauchande <nlaucha...@gmail.com> wrote:
>
> Hi Benjamin,
>
> I have done it . The critical configuration items are the ones below :
>
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",
> AccessKeyId)
>
> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",
> AWSSecretAccessKey)
>
>   val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder
> ")
>
> This code will probe for new S3 files created in your every batch interval.
>
> Thanks,
> Natu
>
> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>
>> Has anyone monitored an S3 bucket or directory using Spark Streaming and
>> pulled any new files to process? If so, can you provide basic Scala coding
>> help on this?
>>
>> Thanks,
>> Ben
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>


Re: Use only latest values

2016-04-09 Thread Natu Lauchande
I don't see this happening without a store. You can try parquet on top of
hdfs.  This will at least avoid third party systems burden.
On 09 Apr 2016 9:04 AM, "Daniela S"  wrote:

> Hi,
>
> I would like to cache values and to use only the latest "valid" values to
> build a sum.
> In more detail, I receive values from devices periodically. I would like
> to add up all the valid values each minute. But not every device sends a
> new value every minute. And as long as there is no new value the old one
> should be used for the sum. As soon as I receive a new value from a device
> I would like to overwrite the old value and to use the new one for the sum.
> Would that be possible with Spark Streaming only? Or would I need a kind of
> distributed cache, like Redis? I also need to group the sums per region.
> Should that be done before I store the values in the cache or afterwards?
>
> Thank you in advance.
>
> Regards,
> Daniela
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Monitoring S3 Bucket with Spark Streaming

2016-04-09 Thread Natu Lauchande
Can you elaborate a bit more in your approach using s3 notifications ? Just
curious. dealing with a similar issue right now that might benefit from
this.
On 09 Apr 2016 9:25 AM, "Nezih Yigitbasi" <nyigitb...@netflix.com> wrote:

> While it is doable in Spark, S3 also supports notifications:
> http://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
>
>
> On Fri, Apr 8, 2016 at 9:15 PM Natu Lauchande <nlaucha...@gmail.com>
> wrote:
>
>> Hi Benjamin,
>>
>> I have done it . The critical configuration items are the ones below :
>>
>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl",
>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>   ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",
>> AccessKeyId)
>>
>> ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",
>> AWSSecretAccessKey)
>>
>>   val inputS3Stream =
>> ssc.textFileStream("s3://example_bucket/folder")
>>
>> This code will probe for new S3 files created in your every batch
>> interval.
>>
>> Thanks,
>> Natu
>>
>> On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim <bbuil...@gmail.com> wrote:
>>
>>> Has anyone monitored an S3 bucket or directory using Spark Streaming and
>>> pulled any new files to process? If so, can you provide basic Scala coding
>>> help on this?
>>>
>>> Thanks,
>>> Ben
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>


Re: Monitoring S3 Bucket with Spark Streaming

2016-04-08 Thread Natu Lauchande
Hi Benjamin,

I have done it . The critical configuration items are the ones below :

  ssc.sparkContext.hadoopConfiguration.set("fs.s3n.impl",
"org.apache.hadoop.fs.s3native.NativeS3FileSystem")
  ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsAccessKeyId",
AccessKeyId)
  ssc.sparkContext.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey",
AWSSecretAccessKey)

  val inputS3Stream =  ssc.textFileStream("s3://example_bucket/folder")

This code will probe for new S3 files created in your every batch interval.

Thanks,
Natu

On Fri, Apr 8, 2016 at 9:14 PM, Benjamin Kim  wrote:

> Has anyone monitored an S3 bucket or directory using Spark Streaming and
> pulled any new files to process? If so, can you provide basic Scala coding
> help on this?
>
> Thanks,
> Ben
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Develop locally with Yarn

2016-04-07 Thread Natu Lauchande
Hi,

I working on a spark streamming app , when in local i use the "local[*]" as
the master of my Spark Streamming Context .

I wonder what would be need to develop locally and run it in Yarn through
the IDE i am using IntelliJ idea.

Thanks,
Natu


Question around spark on EMR

2016-04-05 Thread Natu Lauchande
Hi,

I am setting up a Scala spark streaming app in EMR . I wonder if anyone in
the list can help me with the following question :

1. What's the approach that you guys have been using  to submit in an EMR
job step environment variables that will be needed by the Spark application
?

2. Can i have multiple streamming apps in EMR ?

3. Is there any tool recommended for configuration management ( something
like Consult)


Thanks,
Natu


Question Spark streaming - S3 textFileStream- How to get the current file name ?

2016-04-01 Thread Natu Lauchande
Hi,

I am using spark streamming and using the input strategy of  watching for
files in S3 directories.  Using the textFileStream method in the streamming
context.

The filename contains relevant  for my pipeline manipulation i wonder if
there is a more robust way to get this name other than capturing RDD debug
information and parse the logs.

Thanks,
Natu


Re: Programatically create RDDs based on input

2015-10-31 Thread Natu Lauchande
Hi Amit,

I don't see any default constructor in the JavaRDD docs
https://spark.apache.org/docs/latest/api/java/org/apache/spark/api/java/JavaRDD.html
.

Have you tried the following ?

JavaRDD jRDD[] ;

jRDD.add( jsc.textFile("/file1.txt") )
jRDD.add( jsc.textFile("/file2.txt") )
..
;

Natu


On Sat, Oct 31, 2015 at 11:18 PM, ayan guha  wrote:

> My java knowledge is limited, but you may try with a hashmap and put RDDs
> in it?
>
> On Sun, Nov 1, 2015 at 4:34 AM, amit tewari 
> wrote:
>
>> Thanks Ayan thats something similar to what I am looking at but trying
>> the same in Java is giving compile error:
>>
>> JavaRDD jRDD[] = new JavaRDD[3];
>>
>> //Error: Cannot create a generic array of JavaRDD
>>
>> Thanks
>> Amit
>>
>>
>>
>> On Sat, Oct 31, 2015 at 5:46 PM, ayan guha  wrote:
>>
>>> Corrected a typo...
>>>
>>> # In Driver
>>> fileList=["/file1.txt","/file2.txt"]
>>> rdds = []
>>> for f in fileList:
>>>  rdd = jsc.textFile(f)
>>>  rdds.append(rdd)
>>>
>>>
>>> On Sat, Oct 31, 2015 at 11:14 PM, ayan guha  wrote:
>>>
 Yes, this can be done. quick python equivalent:

 # In Driver
 fileList=["/file1.txt","/file2.txt"]
 rdd = []
 for f in fileList:
  rdd = jsc.textFile(f)
  rdds.append(rdd)



 On Sat, Oct 31, 2015 at 11:09 PM, amit tewari 
 wrote:

> Hi
>
> I need the ability to be able to create RDDs programatically inside my
> program (e.g. based on varaible number of input files).
>
> Can this be done?
>
> I need this as I want to run the following statement inside an
> iteration:
>
> JavaRDD rdd1 = jsc.textFile("/file1.txt");
>
> Thanks
> Amit
>



 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>


Re: How to lookup by a key in an RDD

2015-10-31 Thread Natu Lauchande
Hi,

Looking here for the lookup function might help you:

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

Natu

On Sat, Oct 31, 2015 at 6:04 PM, swetha  wrote:

> Hi,
>
> I have a requirement wherein I have to load data from hdfs, build an RDD
> and
> then lookup by key to do some updates to the value and then save it back to
> hdfs. How to lookup for a value using a key in an RDD?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-lookup-by-a-key-in-an-RDD-tp25243.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Cache in Spark

2015-10-09 Thread Natu Lauchande
I don't think so.

Spark is not keeping the results in memory unless you tell it too.

You have to explicitly call the cache method in your RDD:
linesWithSpark.cache()

Thanks,
Natu




On Fri, Oct 9, 2015 at 10:47 AM, vinod kumar 
wrote:

> Hi Guys,
>
> May I know whether cache is enabled in spark by default?
>
> Thanks,
> Vinod
>


Re: Networking issues with Spark on EC2

2015-09-25 Thread Natu Lauchande
Hi,

Are you using EMR ?

Natu

On Sat, Sep 26, 2015 at 6:55 AM, SURAJ SHETH  wrote:

> Hi Ankur,
> Thanks for the reply.
> This is already done.
> If I wait for a long amount of time(10 minutes), a few tasks get
> successful even on slave nodes. Sometime, a fraction of the tasks(20%) are
> completed on all the machines in the initial 5 seconds and then, it slows
> down drastically.
>
> Thanks and Regards,
> Suraj Sheth
>
> On Fri, Sep 25, 2015 at 2:10 AM, Ankur Srivastava <
> ankur.srivast...@gmail.com> wrote:
>
>> Hi Suraj,
>>
>> Spark uses a lot of ports to communicate between nodes. Probably your
>> security group is restrictive and does not allow instances to communicate
>> on all networks. The easiest way to resolve it is to add a Rule to allow
>> all Inbound traffic on all ports (0-65535) to instances in same security
>> group like this.
>>
>> All TCP
>> TCP
>> 0 - 65535
>>  your security group
>>
>> Hope this helps!!
>>
>> Thanks
>> Ankur
>>
>> On Thu, Sep 24, 2015 at 7:09 AM SURAJ SHETH  wrote:
>>
>>> Hi,
>>>
>>> I am using Spark 1.2 and facing network related issues while performing
>>> simple computations.
>>>
>>> This is a custom cluster set up using ec2 machines and spark prebuilt
>>> binary from apache site. The problem is only when we have workers on other
>>> machines(networking involved). Having a single node for the master and the
>>> slave works correctly.
>>>
>>> The error log from slave node is attached below. It is reading textFile
>>> from local FS(copied each node) and counting it. The first 30 tasks get
>>> completed within 5 seconds. Then, it takes several minutes to complete
>>> another 10 tasks and eventually dies.
>>>
>>> Sometimes, one of the workers completes all the tasks assigned to it.
>>> Different workers have different behavior at different
>>> times(non-deterministic).
>>>
>>> Is it related to something specific to EC2?
>>>
>>>
>>>
>>> 15/09/24 13:04:40 INFO Executor: Running task 117.0 in stage 0.0 (TID
>>> 117)
>>>
>>> 15/09/24 13:04:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:04:41 INFO SendingConnection: Initiating connection to
>>> [master_ip:56305]
>>>
>>> 15/09/24 13:04:41 INFO SendingConnection: Connected to
>>> [master_ip/master_ip_address:56305], 1 messages pending
>>>
>>> 15/09/24 13:05:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:05:41 ERROR Executor: Exception in task 77.0 in stage 0.0
>>> (TID 77)
>>>
>>> java.io.IOException: sendMessageReliably failed because ack was not
>>> received within 60 sec
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>
>>> at scala.Option.foreach(Option.scala:236)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> 15/09/24 13:05:41 INFO CoarseGrainedExecutorBackend: Got assigned task
>>> 122
>>>
>>> 15/09/24 13:05:41 INFO Executor: Running task 3.1 in stage 0.0 (TID 122)
>>>
>>> 15/09/24 13:06:41 ERROR Executor: Exception in task 113.0 in stage 0.0
>>> (TID 113)
>>>
>>> java.io.IOException: sendMessageReliably failed because ack was not
>>> received within 60 sec
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:918)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13$$anonfun$run$19.apply(ConnectionManager.scala:917)
>>>
>>> at scala.Option.foreach(Option.scala:236)
>>>
>>> at
>>> org.apache.spark.network.nio.ConnectionManager$$anon$13.run(ConnectionManager.scala:917)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:581)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:656)
>>>
>>> at
>>> io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:367)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> 15/09/24 13:06:41 INFO TorrentBroadcast: Started reading broadcast
>>> variable 1
>>>
>>> 15/09/24 13:06:41 INFO SendingConnection: Initiating connection to
>>> [master_ip/master_ip_address:44427]
>>>
>>> 15/09/24 13:06:41 INFO SendingConnection: Connected to
>>> [master_ip/master_ip_address:44427], 1 messages pending
>>>
>>> 15/09/24 13:07:41 ERROR Executor: Exception in task 37.0 in stage 0.0
>>> (TID 

Re: why is spark + scala code so slow, compared to python?

2014-12-11 Thread Natu Lauchande
Are you using Scala in a distributed enviroment or in a standalone mode ?

Natu

On Thu, Dec 11, 2014 at 8:23 PM, ll duy.huynh@gmail.com wrote:

 hi.. i'm converting some of my machine learning python code into scala +
 spark.  i haven't been able to run it on large dataset yet, but on small
 datasets (like http://yann.lecun.com/exdb/mnist/), my spark + scala code
 is
 much slower than my python code (5 to 10 times slower than python)

 i already tried everything to improve my spark + scala code like
 broadcasting variables, caching the RDD, replacing all my matrix/vector
 operations with breeze/blas, etc.  i saw some improvements, but it's still
 a
 lot slower than my python code.

 why is that?

 how do you improve your spark + scala performance today?

 or is spark + scala just not the right tool for small to medium datasets?

 when would you use spark + scala vs. python?

 thanks!



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/why-is-spark-scala-code-so-slow-compared-to-python-tp20636.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-25 Thread Natu Lauchande
Fantastic!!! Exactly what i was looking for.


Thanks,
Natu

On Tue, Nov 25, 2014 at 10:46 AM, Sean Owen so...@cloudera.com wrote:

 Yes, and I prepared a basic talk on this exact topic. Slides here:

 http://www.slideshare.net/srowen/anomaly-detection-with-apache-spark-41975155

 This is elaborated in a chapter of an upcoming book that's available
 in early release; you can look at the accompanying source code to get
 some ideas too: https://github.com/sryza/aas/tree/master/kmeans

 On Mon, Nov 24, 2014 at 10:17 PM, Natu Lauchande nlaucha...@gmail.com
 wrote:
  Hi all,
 
  I am getting started with Spark.
 
  I would like to use for a spike on anomaly detection in a massive stream
 of
  metrics.
 
  Can Spark easily handle this use case ?
 
  Thanks,
  Natu



Ideas on how to use Spark for anomaly detection on a stream of data

2014-11-24 Thread Natu Lauchande
Hi all,

I am getting started with Spark.

I would like to use for a spike on anomaly detection in a massive stream
of  metrics.

Can Spark easily handle this use case ?

Thanks,
Natu