. if the pipeline object is null.) This seems reasonable to
me. I will try it on an actual cluster next
Thanks,
Philip
On 10/22/2013 11:50 AM, Philip Ogren wrote:
I have a text analytics pipeline that performs a sequence of steps
(e.g. tokenization, part-of-speech tagging, etc
My team is investigating a number of technologies in the Big Data
space. A team member recently got turned on to Cascading
http://www.cascading.org/about-cascading/ as an application layer for
orchestrating complex workflows/scenarios. He asked me if Spark had an
application layer? My
Hi Arun,
I had recent success getting a Spark project set up in Eclipse Juno.
Here are the notes that I wrote down for the rest of my team that you
may perhaps find useful:
Spark version 0.8.0 requires Scala version 2.9.3. This is a bit
inconvenient because Scala is now on version 2.10.3
On the front page http://spark.incubator.apache.org/ of the Spark
website there is the following simple word count implementation:
file = spark.textFile(hdfs://...)
file.flatMap(line = line.split( )).map(word = (word,
1)).reduceByKey(_ + _)
The same code can be found in the Quick Start
for third-party
apps.
Matei
On Nov 7, 2013, at 1:15 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I remember running into something very similar when trying to perform
a foreach on java.util.List and I fixed it by adding the following
import:
import
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up
how many times each column is populated in an RDD. Here is the code I
came up with:
//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile(myfile.tsv).map(line
can
collect at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote:
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up how
many times each column is populated in an RDD. Here is the code I came up
with:
//RDD
an ID for the column (maybe its index) and a flag for
whether it's present.
Then you reduce by key to get the per-column count. Then you can
collect at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com
wrote:
Hi Spark coders,
I wrote my first little Spark
Hao,
If you have worked out the code and turn it into an example that you can
share, then please do! This task is in my queue of things to do so any
helpful details that you uncovered would be most appreciated.
Thanks,
Philip
On 11/13/2013 5:30 AM, Hao REN wrote:
Ok, I worked it out.
Hi Spark community,
I learned a lot the last time I posted some elementary Spark code here.
So, I thought I would do it again. Someone politely tell me offline if
this is noise or unfair use of the list! I acknowledge that this
borders on asking Scala 101 questions
I have an
Here's a good place to start:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E
On 12/5/2013 10:18 AM, Benjamin Kim wrote:
Does anyone have an example or some sort of starting point code when
I have a simple scenario that I'm struggling to implement. I would like
to take a fairly simple RDD generated from a large log file, perform
some transformations on it, and write the results out such that I can
perform a Hive query either from Hive (via Hue) or Shark. I'm having
troubles
/11/13 Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com
Hao,
If you have worked out the code and turn it into an example that
you can share, then please do! This task is in my queue of things
to do so any helpful details that you uncovered would be most
://linkedin.com/in/ctnguyen
On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a simple scenario that I'm struggling to implement. I
would like to take a fairly simple RDD generated from a large log
file, perform some
You might try a more standard windows path. I typically write to a
local directory such as target/spark-output.
On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote:
We are trying to test out running Spark 0.8.0 on a Windows box, and
while we can get it to run all the examples that don't output
When I call rdd.saveAsTextFile(hdfs://...) it uses my username to
write to the HDFS drive. If I try to write to an HDFS directory that I
do not have permissions to, then I get an error like this:
Permission denied: user=me, access=WRITE,
inode=/user/you/:you:us:drwxr-xr-x
I can obviously
name? When i use
spark it writes to hdfs as the user that runs the spark services... i
wish it read and wrote as me.
On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
When I call rdd.saveAsTextFile(hdfs://...) it uses my username
Hi Spark Community,
I would like to expose my spark application/libraries via a web service
in order to launch jobs, interact with users, etc. I'm sure there are
100's of ways to think about doing this each with a variety of
technology stacks that could be applied. So, I know there is no
, you can use the NLineInputFormat i guess which is
provided by hadoop. And pass it as a parameter.
May be there are better ways to do it.
Regards,
Suman Bharadwaj S
On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren
philip.og...@oracle.com mailto:philip.og...@oracle.com wrote
I have a very simple Spark application that looks like the following:
var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)
myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)
The println statements work as
-machine
cluster though -- you may get a bit of data on each machine in that
local directory.
On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a very simple Spark application that looks like the following:
var myRdd
this on a multi-machine
cluster though -- you may get a bit of data on each machine in that
local directory.
On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a very simple Spark application that looks like the following
?
On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I just tried your suggestion and get the same results with the
_temporary directory. Thanks though.
On 1/2/2014 10:28 AM, Andrew Ash wrote:
You want to write
Great question! I was writing up a similar question this morning and
decided to investigate some more before sending. Here's what I'm
trying. I have created a new scala project that contains only
spark-examples-assembly-0.8.1-incubating.jar and
My problem seems to be related to this:
https://issues.apache.org/jira/browse/MAPREDUCE-4052
So, I will try running my setup from a Linux client and see if I have
better luck.
On 1/15/2014 11:38 AM, Philip Ogren wrote:
Great question! I was writing up a similar question this morning
In my Spark programming thus far my unit of work has been a single row
from an hdfs file by creating an RDD[Array[String]] with something like:
spark.textFile(path).map(_.split(\t))
Now, I'd like to do some work over a large collection of files in which
the unit of work is a single file
] [content]
Anyone have better ideas ?
2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com 写道:
In my Spark programming thus far my unit of work has been
a single row from an hdfs file by creating an
RDD
I have a few questions about yarn-standalone and yarn-client deployment
modes that are described on the Launching Spark on YARN
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page.
1) Can someone give me a basic conceptual overview? I am struggling
with understanding the
28 matches
Mail list logo