Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread Michael Cutler
Reading files directly from Amazon S3 can be frustrating especially if you're dealing with a large number of input files, could you please elaborate more on your use-case? Does the S3 bucket in question already contain a large number of files? The implementation of the * wildcard operator in S3

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
as failover/retry logic etc. Best of luck! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered

Re: Spark streaming and rate limit

2014-06-19 Thread Michael Cutler
..) is how to limit the external service call rate and manage the incoming buffer size (enqueuing). Could you give me some tips for that? Thanks again, Flavio On Thu, Jun 19, 2014 at 10:19 AM, Michael Cutler mich...@tumra.com wrote: Hello Flavio, It sounds to me like the best solution

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
When you start seriously using Spark in production there are basically two things everyone eventually needs: 1. Scheduled Jobs - recurring hourly/daily/weekly jobs. 2. Always-On Jobs - that require monitoring, restarting etc. There are lots of ways to implement these requirements,

Re: How do you run your spark app?

2014-06-19 Thread Michael Cutler
TAR.GZ direct from HDFS, unpack it and launch the appropriate script. Makes for a much cleaner development / testing / deployment to package everything required in one go instead of relying on cluster specific classpath additions or any add-jars functionality. On 19 June 2014 22:53, Michael Cutler

Re: spark streaming, kafka, SPARK_CLASSPATH

2014-06-17 Thread Michael Cutler
and have them managed using Mesos/Marathon http://mesosphere.io/ to handle failures and restarts with long running processes. Good luck! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source

Re: Is There Any Benchmarks Comparing C++ MPI with Spark

2014-06-16 Thread Michael Cutler
Hello Wei, I talk from experience of writing many HPC distributed application using Open MPI (C/C++) on x86, PowerPC and Cell B.E. processors, and Parallel Virtual Machine (PVM) way before that back in the 90's. I can say with absolute certainty: *Any gains you believe there are because C++ is

Re: Using Spark to crack passwords

2014-06-12 Thread Michael Cutler
in HDFS. Done right you should be able to achieve interactive (few second) lookups. Have fun! MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices

Re: json parsing with json4s

2014-06-11 Thread Michael Cutler
Hello, You're absolutely right, the syntax you're using is returning the json4s value objects, not native types like Int, Long etc. fix that problem and then everything else (filters) will work as you expect. This is a short snippet of a larger example: [1] val lines =

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler
Hey Nilesh, Great to hear your using Spark Streaming, in my opinion the crux of your question comes down to what you want to do with the data in the future and/or if there is utility it using it from more than one Spark/Streaming job. 1). *One-time-use fire and forget *- as you rightly point

Re: How can I make Spark 1.0 saveAsTextFile to overwrite existing file

2014-06-02 Thread Michael Cutler
*Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England Wales, 07916412. VAT No. 130595328

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
/cotdp/b5b8155bb85e254d2a3c MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England

Re: Using Spark to analyze complex JSON

2014-05-22 Thread Michael Cutler
and REGEXP so clearly some of the basics are in there. As the saying goes ... *Use the source, Luke! http://blog.codinghorror.com/learn-to-read-the-source-luke/* :o) ᐧ *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http

Re: Spark Job Server first steps

2014-05-22 Thread Michael Cutler
Eclipse. Best, Michael *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our offices in Chiswick Park http://goo.gl/maps/abBxq* *Registered in England

Re: count()-ing gz files gives java.io.IOException: incorrect header check

2014-05-21 Thread Michael Cutler
https://issues.apache.org/jira/browse/HADOOP-8900 and it affects all Hadoop releases prior to 1.2.X MC *Michael Cutler* Founder, CTO *Mobile: +44 789 990 7847Email: mich...@tumra.com mich...@tumra.comWeb: tumra.com http://tumra.com/?utm_source=signatureutm_medium=email* *Visit us at our

Re: facebook data mining with Spark

2014-05-20 Thread Michael Cutler
this on a small sample of data you get results like this: - female: average=114, count=15422 - male: average=104, count=14727 Which basically says the average level achieved by women is slightly higher than guys. Best of luck fishing through Facebook data! MC *Michael Cutler* Founder, CTO