New Amazon AMIs for EC2 script

2017-02-23 Thread in4maniac
Hyy all, 

I have been using the EC2 script to launch R&D pyspark clusters for a while
now. As we use alot of packages such as numpy and scipy with openblas,
scikit-learn, bokeh, vowpal wabbit, pystan and etc... All this time, we have
been building AMIs on top of the standard spark-AMIs at
https://github.com/amplab/spark-ec2/tree/branch-1.6/ami-list/us-east-1 

Mainly, I have done the following:
- updated yum
- Changed the standard python to python 2.7
- changed pip to 2.7 and installed alot of libararies on top of the existing
AMIs and created my own AMIs to avoid having to boostrap. 

But the ec-2 standard AMIs are from *Early February , 2014* and now have
become extremely fragile. For example, when I update a certain library,
ipython would break, or pip would break and so forth. 

Can someone please direct me to a more upto date AMI that I can use with
more confidence. And I am also interested to know what things need to be in
the AMI, if I wanted to build an AMI from scratch (Last resort :( )

And isn't it time to have a ticket in the spark project to build a new suite
of AMIs for the EC2 script? https://issues.apache.org/jira/browse/SPARK-922 

Many thanks
in4maniac 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-Amazon-AMIs-for-EC2-script-tp28419.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



listening to recursive folder structures in s3 using pyspark streaming (textFileStream)

2016-02-17 Thread in4maniac
Hi all, 

I am new to pyspark streaming and I was following a tutorial I saw in the
internet
(https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/network_wordcount.py).
But I replaced the data input with an s3 directory path as:

lines = ssc.textFileStream("s3n://bucket/first/second/third1/")

When I run the code and upload a file to s3n://bucket/first/second/third1/
(such as s3n://bucket/first/second/third1/test1.txt), the file gets
processed as expected. 

Now I want it to listen to multiple directories and process files if they
get uploaded to any of the directories:
for example : [s3n://bucket/first/second/third1/,
s3n://bucket/first/second/third2/ and s3n://bucket/first/second/third3/]

I tried to use the pattern similar to sc.TextFile as : 

lines = ssc.textFileStream("s3n://bucket/first/second/*/")

But this actually didn't work. Can someone please explain to me how I could
achieve my objective? 

thanks in advance !!!

in4maniac




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/listening-to-recursive-folder-structures-in-s3-using-pyspark-streaming-textFileStream-tp26247.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading file content based on offsets into the memory

2015-05-10 Thread in4maniac
As far as I know, that is not possible. If the file is too big to load to one
node, What I would do is to use a RDD.map() function instead to load the
file to distributed memory and then filter the lines that are relevant to
me. 

I am not sure how to just read part of a single file. Sorry I'm unable to
help here :( 

-in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-file-content-based-on-offsets-into-the-memory-tp22802p22836.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-08 Thread in4maniac
HI GUYS... I realised that it was a bug in my code that caused the code to
break.. I was running the filter on a SchemaRDD when I was supposed to be
running it on an RDD. 

But I still don't understand why the stderr was about S3 request rather than
a type checking error such as "No tuple position 0 found in Row type" was
thrown. The error was kinda misleading that I kindof oversaw this logical
error in my code. 

Just thought should keep this posted. 

-in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-fails-with-org-apache-hadoop-fs-s3-S3Exception-FORBIDDEN-tp22800p22815.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Loading file content based on offsets into the memory

2015-05-07 Thread in4maniac
When loading multiple files, spark loads each file as a partition(block). You
can run a function on each partition by using rdd.mapPartitions(function)
function. 

I think you can write a funciton x that extracts everything after the offset
and use this funtion with mapPartitions to extract the relevant lines for
each file. 

Hope this helps





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Loading-file-content-based-on-offsets-into-the-memory-tp22802p22804.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.3.1 and Parquet Partitions

2015-05-07 Thread in4maniac
Hi V, 

I am assuming that each of the three .parquet paths you mentioned have
multiple partitions in them. 

For eg: [/dataset/city=London/data.parquet/part-r-0.parquet,
/dataset/city=London/data.parquet/part-r-1.parquet]

I haven't personally used this with "hdfs", but I've worked with a similar
file strucutre with '=' in "S3". 

And how i get around this is by building a string of all the filepaths
seperated by commas (with NO spaces inbetween). Then I use that string as
the filepath parameter. I think the following adaptation of S3 file access
pattern to HDFS would work

If I want to load 1 file:
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet")

If I want to load multiple files (lets say all 3 of them):
sqlcontext.parquetFile( "hdfs://some
ip:8029/dataset/city=London/data.parquet,hdfs://some
ip:8029/dataset/city=NewYork/data.parquet,hdfs://some
ip:8029/dataset/city=Paris/data.parquet")

*** But in the multiple file scenario, the schema of all the files should be
the same

I hope you can use this S3 pattern with HDFS and hope it works !!

Thanks
in4



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-3-1-and-Parquet-Partitions-tp22792p22801.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



AWS-Credentials fails with org.apache.hadoop.fs.s3.S3Exception: FORBIDDEN

2015-05-07 Thread in4maniac
Hi Guys, 

I think this problem is related to : 
http://apache-spark-user-list.1001560.n3.nabble.com/AWS-Credentials-for-private-S3-reads-td8689.html

I am running pyspark 1.2.1 in AWS with my AWS credentials exported to master
node as Environmental Variables.

Halfway through my application, I get thrown with a
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 HEAD request failed for "file path" - ResponseCode=403,
ResponseMessage=Forbidden

Here is some important information about my job: 
+ my AWS credentials exported to master node as Environmental Variables
+ there are no '/'s in my secret key
+ The earlier steps that uses this parquet file actually complete
successsfully
+ The step before the count() does the following:
   + reads the parquet file (SELECT STATEMENT)
   + maps it to an RDD
   + runs a filter on the RDD
+ The filter works as follows:
   + extracts one field from each RDD line
   + checks with a list of 40,000 hashes for presence (if field in
LIST_OF_HASHES.value)
   + LIST_OF_HASHES is a broadcast object

The wierdness is that I am using this parquet file in earlier steps and it
works fine. The other confusion I have is due to the fact that it only
starts failing halfway through the stage. It completes a fraction of tasks
and then starts failing..  

Hoping to hear something positive. Many thanks in advance

Sahanbull

The stack trace is as follows:
>>> negativeObs.count()
[Stage 9:==>   (161 + 240) /
800]

15/05/07 07:55:59 ERROR TaskSetManager: Task 277 in stage 9.0 failed 4
times; aborting job
Traceback (most recent call last):
  File "", line 1, in 
  File "/root/spark/python/pyspark/rdd.py", line 829, in count
return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
  File "/root/spark/python/pyspark/rdd.py", line 820, in sum
return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
  File "/root/spark/python/pyspark/rdd.py", line 725, in reduce
vals = self.mapPartitions(func).collect()
  File "/root/spark/python/pyspark/rdd.py", line 686, in collect
bytesInJava = self._jrdd.collect().iterator()
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File "/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o139.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task
277 in stage 9.0 failed 4 times, most recent failure: Lost task 277.3 in
stage 9.0 (TID 4832, ip-172-31-1-185.ec2.internal):
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException:
S3 HEAD request failed for
'/subbucket%2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2F2Fpath%2Ffilename.parquet%2Fpart-r-349.parquet'
- ResponseCode=403, ResponseMessage=Forbidden
at
org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:122)
at sun.reflect.GeneratedMethodAccessor116.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at org.apache.hadoop.fs.s3native.$Proxy9.retrieveMetadata(Unknown
Source)
at
org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:326)
at
parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:381)
at
parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:155)
at
parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:135)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:107)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:69)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:120)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:247)
at
org.apache.spark.api.python.