Re: AWS Credentials for private S3 reads

2014-07-02 Thread Brian Gawalt
HUH; not-scrubbing the slashes fixed it. I would have sworn I tried it, got a 403 Forbidden, then remembered the slash prescription. Can confirm I was never scrubbing the actual URIs. It looks like it'd all be working now except it's smacking its head against: 14/07/02 23:37:38 INFO rdd.HadoopRDD:

AWS Credentials for private S3 reads

2014-07-02 Thread Brian Gawalt
Hello everyone, I'm having some difficulty reading from my company's private S3 buckets. I've got an S3 access key and secret key, and I can read the files fine from a non-Spark Scala routine via AWScala. But trying to read them with the SparkContext.textFiles([comma separated s3n://bucket/key u

AWS Credentials for private S3 reads

2014-07-02 Thread Brian Gawalt
Hello everyone, I'm having some difficulty reading from my company's private S3 buckets. I've got an S3 access key and secret key, and I can read the files fine from a non-Spark Scala routine via AWScala . But trying to read them with the SparkContext.textFil

Re: Selecting first ten values in a RDD/partition

2014-05-29 Thread Brian Gawalt
Try looking at the .mapPartitions( ) method implemented for RDD[T] objects. It will give you direct access to an iterator containing the member objects of each partition for doing the kind of within-partition hashtag counts you're describing. -- View this message in context: http://apache-spark

Re: Understanding epsilon in KMeans

2014-05-16 Thread Brian Gawalt
Hi Stuti, I think you're right. The epsilon parameter is indeed used as a threshold for deciding when KMeans has converged. If you look at line 201 of mllib's KMeans.scala: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L201 you ca

Re: Using String Dataset for Logistic Regression

2014-05-16 Thread Brian Gawalt
Pravesh, Correct, the logistic regression engine is set up to perform classification tasks that take feature vectors (arrays of real-valued numbers) that are given a class label, and learning a linear combination of those features that divide the classes. As the above commenters have mentioned, th

Re: accessing partition i+1 from mapper of partition i

2014-05-16 Thread Brian Gawalt
I don't think there's a direct way of bleeding elements across partitions. But you could write it yourself relatively succinctly: A) Sort the RDD B) Look at the sorted RDD's partitions with the .mapParititionsWithIndex( ) method. Map each partition to its partition ID, and its maximum element. Col