RDD[URI]

2014-01-30 Thread Philip Ogren
In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD[Array[String]] with something like: spark.textFile(path).map(_.split(\t)) Now, I'd like to do some work over a large collection of files in which the unit of work is a single file

Re: RDD[URI]

2014-01-30 Thread Philip Ogren
] [content] Anyone have better ideas ? 2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com 写道: In my Spark programming thus far my unit of work has been a single row from an hdfs file by creating an RDD

various questions about yarn-standalone vs. yarn-client

2014-01-30 Thread Philip Ogren
I have a few questions about yarn-standalone and yarn-client deployment modes that are described on the Launching Spark on YARN http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page. 1) Can someone give me a basic conceptual overview? I am struggling with understanding the

Re: Anyone know hot to submit spark job to yarn in java code?

2014-01-15 Thread Philip Ogren
Great question! I was writing up a similar question this morning and decided to investigate some more before sending. Here's what I'm trying. I have created a new scala project that contains only spark-examples-assembly-0.8.1-incubating.jar and

Re: Anyone know hot to submit spark job to yarn in java code?

2014-01-15 Thread Philip Ogren
My problem seems to be related to this: https://issues.apache.org/jira/browse/MAPREDUCE-4052 So, I will try running my setup from a Linux client and see if I have better luck. On 1/15/2014 11:38 AM, Philip Ogren wrote: Great question! I was writing up a similar question this morning

rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren
I have a very simple Spark application that looks like the following: var myRdd: RDD[Array[String]] = initMyRdd() println(myRdd.first.mkString(, )) println(myRdd.count) myRdd.saveAsTextFile(hdfs://myserver:8020/mydir) myRdd.saveAsTextFile(target/mydir/) The println statements work as

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren
-machine cluster though -- you may get a bit of data on each machine in that local directory. On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a very simple Spark application that looks like the following: var myRdd

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren
this on a multi-machine cluster though -- you may get a bit of data on each machine in that local directory. On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a very simple Spark application that looks like the following

Re: rdd.saveAsTextFile problem

2014-01-02 Thread Philip Ogren
? On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I just tried your suggestion and get the same results with the _temporary directory. Thanks though. On 1/2/2014 10:28 AM, Andrew Ash wrote: You want to write

Re: multi-line elements

2013-12-24 Thread Philip Ogren
, you can use the NLineInputFormat i guess which is provided by hadoop. And pass it as a parameter. May be there are better ways to do it. Regards, Suman Bharadwaj S On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote

Re: writing to HDFS with a given username

2013-12-13 Thread Philip Ogren
name? When i use spark it writes to hdfs as the user that runs the spark services... i wish it read and wrote as me. On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: When I call rdd.saveAsTextFile(hdfs://...) it uses my username

exposing spark through a web service

2013-12-13 Thread Philip Ogren
Hi Spark Community, I would like to expose my spark application/libraries via a web service in order to launch jobs, interact with users, etc. I'm sure there are 100's of ways to think about doing this each with a variety of technology stacks that could be applied. So, I know there is no

writing to HDFS with a given username

2013-12-12 Thread Philip Ogren
When I call rdd.saveAsTextFile(hdfs://...) it uses my username to write to the HDFS drive. If I try to write to an HDFS directory that I do not have permissions to, then I get an error like this: Permission denied: user=me, access=WRITE, inode=/user/you/:you:us:drwxr-xr-x I can obviously

Re: Fwd: Spark forum question

2013-12-11 Thread Philip Ogren
You might try a more standard windows path. I typically write to a local directory such as target/spark-output. On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote: We are trying to test out running Spark 0.8.0 on a Windows box, and while we can get it to run all the examples that don't output

Re: Writing an RDD to Hive

2013-12-09 Thread Philip Ogren
://linkedin.com/in/ctnguyen On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I have a simple scenario that I'm struggling to implement. I would like to take a fairly simple RDD generated from a large log file, perform some

Writing an RDD to Hive

2013-12-06 Thread Philip Ogren
I have a simple scenario that I'm struggling to implement. I would like to take a fairly simple RDD generated from a large log file, perform some transformations on it, and write the results out such that I can perform a Hive query either from Hive (via Hue) or Shark. I'm having troubles

Re: write data into HBase via spark

2013-12-06 Thread Philip Ogren
/11/13 Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com Hao, If you have worked out the code and turn it into an example that you can share, then please do! This task is in my queue of things to do so any helpful details that you uncovered would be most

Re: Writing to HBase

2013-12-05 Thread Philip Ogren
Here's a good place to start: http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E On 12/5/2013 10:18 AM, Benjamin Kim wrote: Does anyone have an example or some sort of starting point code when

Re: write data into HBase via spark

2013-11-13 Thread Philip Ogren
Hao, If you have worked out the code and turn it into an example that you can share, then please do! This task is in my queue of things to do so any helpful details that you uncovered would be most appreciated. Thanks, Philip On 11/13/2013 5:30 AM, Hao REN wrote: Ok, I worked it out.

code review - splitting columns

2013-11-13 Thread Philip Ogren
Hi Spark community, I learned a lot the last time I posted some elementary Spark code here. So, I thought I would do it again. Someone politely tell me offline if this is noise or unfair use of the list! I acknowledge that this borders on asking Scala 101 questions I have an

code review - counting populated columns

2013-11-08 Thread Philip Ogren
Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD of List[String] corresponding to tab delimited values val columns = spark.textFile(myfile.tsv).map(line

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
can collect at the end. - Patrick On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote: Hi Spark coders, I wrote my first little Spark job that takes columnar data and counts up how many times each column is populated in an RDD. Here is the code I came up with: //RDD

Re: code review - counting populated columns

2013-11-08 Thread Philip Ogren
an ID for the column (maybe its index) and a flag for whether it's present. Then you reduce by key to get the per-column count. Then you can collect at the end. - Patrick On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote: Hi Spark coders, I wrote my first little Spark

Where is reduceByKey?

2013-11-07 Thread Philip Ogren
On the front page http://spark.incubator.apache.org/ of the Spark website there is the following simple word count implementation: file = spark.textFile(hdfs://...) file.flatMap(line = line.split( )).map(word = (word, 1)).reduceByKey(_ + _) The same code can be found in the Quick Start

Re: Where is reduceByKey?

2013-11-07 Thread Philip Ogren
for third-party apps. Matei On Nov 7, 2013, at 1:15 PM, Philip Ogren philip.og...@oracle.com mailto:philip.og...@oracle.com wrote: I remember running into something very similar when trying to perform a foreach on java.util.List and I fixed it by adding the following import: import

compare/contrast Spark with Cascading

2013-10-28 Thread Philip Ogren
My team is investigating a number of technologies in the Big Data space. A team member recently got turned on to Cascading http://www.cascading.org/about-cascading/ as an application layer for orchestrating complex workflows/scenarios. He asked me if Spark had an application layer? My

Re: set up spark in eclipse

2013-10-28 Thread Philip Ogren
Hi Arun, I had recent success getting a Spark project set up in Eclipse Juno. Here are the notes that I wrote down for the rest of my team that you may perhaps find useful: Spark version 0.8.0 requires Scala version 2.9.3. This is a bit inconvenient because Scala is now on version 2.10.3

Re: unable to serialize analytics pipeline

2013-10-22 Thread Philip Ogren
. if the pipeline object is null.) This seems reasonable to me. I will try it on an actual cluster next Thanks, Philip On 10/22/2013 11:50 AM, Philip Ogren wrote: I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc