Does anyone know if there Spark assemblies are created and available for
download that have been built for CDH5 and YARN?
Thanks,
Philip
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands,
`(myquery))
I'm sure it won't take much imagination to figure out how to the the
matching in a batch way.
If anyone has done anything along these lines I'd love to have some
feedback.
Thanks,
Philip
On 08/04/2014 09:46 AM, Philip Ogren wrote:
This looks like a really cool feature and it seems
It is really nice that Spark RDD's provide functions that are often
equivalent to functions found in Scala collections. For example, I can
call:
myArray.map(myFx)
and equivalently
myRdd.map(myFx)
Awesome!
My question is this. Is it possible to write code that works on either
an RDD or
-parameter-forwarding-possible-in-scala
I'm not seeing a way to utilize implicit conversions in this case. Since Scala
is statically (albeit inferred) typed, I don't see a way around having a common
supertype.
On Monday, July 21, 2014 11:01 AM, Philip Ogren philip.og...@oracle.com wrote:
It is really
Hi Patrick,
This is great news but I nearly missed the announcement because it had
scrolled off the folder view that I have Spark users list messages go
to. 40+ new threads since you sent the email out on Friday evening.
You might consider having someone on your team create a
In various previous versions of Spark (and I believe the current
version, 1.0.0, as well) we have noticed that it does not seem possible
to have a local SparkContext and a SparkContext connected to a cluster
via either a Spark Cluster (i.e. using the Spark resource manager) or a
YARN cluster.
In my unit tests I have a base class that all my tests extend that has a
setup and teardown method that they inherit. They look something like this:
var spark: SparkContext = _
@Before
def setUp() {
Thread.sleep(100L) //this seems to give spark more time to
reset from the
I asked a question related to Marcelo's answer a few months ago. The
discussion there may be useful:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-URI-td1054.html
On 06/02/2014 06:09 PM, Marcelo Vanzin wrote:
Hi Jamal,
If what you want is to process lots of files in parallel, the
Hi Pierre,
I asked a similar question on this list about 6 weeks ago. Here is one
answer
http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccamjob8n3foaxd-dc5j57-n1oocwxefcg5chljwnut7qnreq...@mail.gmail.com%3E
I got that is of particular note:
In the upcoming release of
Have you actually found this to be true? I have found Spark local mode
to be quite good about blowing up if there is something non-serializable
and so my unit tests have been great for detecting this. I have never
seen something that worked in local mode that didn't work on the cluster
Great reference! I just skimmed through the results without reading
much of the methodology - but it looks like Spark outperforms
Stratosphere fairly consistently in the experiments. It's too bad the
data sources only range from 2GB to 8GB. Who knows if the apparent
pattern would extend out
Has there been any thought to adding a tail() method to RDD? It would
be really handy to skip over the first item in an RDD when it contains
header information. Even better would be a drop(int) function that
would allow you to skip over several lines of header information. Our
attempts to
arbitrary
format and will be deprecated soon. If you find this feature useful,
you can test it out by building the master branch of Spark yourself,
following the instructions in https://github.com/apache/spark/pull/42.
Andrew
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com
directly - I think
it's been factored nicely so it's fairly decoupled from the UI. The
concern is this is a semi-internal piece of functionality and
something we might, e.g. want to change the API of over time.
- Patrick
On Wed, Apr 2, 2014 at 3:39 PM, Philip Ogren philip.og...@oracle.com
to figure out
how to do this or if it is possible.
Any advice is appreciated.
Thanks,
Philip
On 04/01/2014 09:43 AM, Philip Ogren wrote:
Hi DB,
Just wondering if you ever got an answer to your question about
monitoring progress - either offline or through your own
investigation. Any findings
In my Spark programming thus far my unit of work has been a single row
from an hdfs file by creating an RDD[Array[String]] with something like:
spark.textFile(path).map(_.split(\t))
Now, I'd like to do some work over a large collection of files in which
the unit of work is a single file
] [content]
Anyone have better ideas ?
2014-1-31 AM12:18于 Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com 写道:
In my Spark programming thus far my unit of work has been
a single row from an hdfs file by creating an
RDD
I have a few questions about yarn-standalone and yarn-client deployment
modes that are described on the Launching Spark on YARN
http://spark.incubator.apache.org/docs/latest/running-on-yarn.html page.
1) Can someone give me a basic conceptual overview? I am struggling
with understanding the
Great question! I was writing up a similar question this morning and
decided to investigate some more before sending. Here's what I'm
trying. I have created a new scala project that contains only
spark-examples-assembly-0.8.1-incubating.jar and
My problem seems to be related to this:
https://issues.apache.org/jira/browse/MAPREDUCE-4052
So, I will try running my setup from a Linux client and see if I have
better luck.
On 1/15/2014 11:38 AM, Philip Ogren wrote:
Great question! I was writing up a similar question this morning
I have a very simple Spark application that looks like the following:
var myRdd: RDD[Array[String]] = initMyRdd()
println(myRdd.first.mkString(, ))
println(myRdd.count)
myRdd.saveAsTextFile(hdfs://myserver:8020/mydir)
myRdd.saveAsTextFile(target/mydir/)
The println statements work as
-machine
cluster though -- you may get a bit of data on each machine in that
local directory.
On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a very simple Spark application that looks like the following:
var myRdd
this on a multi-machine
cluster though -- you may get a bit of data on each machine in that
local directory.
On Thu, Jan 2, 2014 at 12:22 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a very simple Spark application that looks like the following
?
On Thu, Jan 2, 2014 at 12:54 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I just tried your suggestion and get the same results with the
_temporary directory. Thanks though.
On 1/2/2014 10:28 AM, Andrew Ash wrote:
You want to write
, you can use the NLineInputFormat i guess which is
provided by hadoop. And pass it as a parameter.
May be there are better ways to do it.
Regards,
Suman Bharadwaj S
On Wed, Dec 25, 2013 at 1:57 AM, Philip Ogren
philip.og...@oracle.com mailto:philip.og...@oracle.com wrote
name? When i use
spark it writes to hdfs as the user that runs the spark services... i
wish it read and wrote as me.
On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
When I call rdd.saveAsTextFile(hdfs://...) it uses my username
Hi Spark Community,
I would like to expose my spark application/libraries via a web service
in order to launch jobs, interact with users, etc. I'm sure there are
100's of ways to think about doing this each with a variety of
technology stacks that could be applied. So, I know there is no
When I call rdd.saveAsTextFile(hdfs://...) it uses my username to
write to the HDFS drive. If I try to write to an HDFS directory that I
do not have permissions to, then I get an error like this:
Permission denied: user=me, access=WRITE,
inode=/user/you/:you:us:drwxr-xr-x
I can obviously
You might try a more standard windows path. I typically write to a
local directory such as target/spark-output.
On 12/11/2013 10:45 AM, Nathan Kronenfeld wrote:
We are trying to test out running Spark 0.8.0 on a Windows box, and
while we can get it to run all the examples that don't output
://linkedin.com/in/ctnguyen
On Fri, Dec 6, 2013 at 7:06 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I have a simple scenario that I'm struggling to implement. I
would like to take a fairly simple RDD generated from a large log
file, perform some
I have a simple scenario that I'm struggling to implement. I would like
to take a fairly simple RDD generated from a large log file, perform
some transformations on it, and write the results out such that I can
perform a Hive query either from Hive (via Hue) or Shark. I'm having
troubles
/11/13 Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com
Hao,
If you have worked out the code and turn it into an example that
you can share, then please do! This task is in my queue of things
to do so any helpful details that you uncovered would be most
Here's a good place to start:
http://mail-archives.apache.org/mod_mbox/incubator-spark-user/201311.mbox/%3ccacyzca3askwd-tujhqi1805bn7sctguaoruhd5xtxcsul1a...@mail.gmail.com%3E
On 12/5/2013 10:18 AM, Benjamin Kim wrote:
Does anyone have an example or some sort of starting point code when
Hao,
If you have worked out the code and turn it into an example that you can
share, then please do! This task is in my queue of things to do so any
helpful details that you uncovered would be most appreciated.
Thanks,
Philip
On 11/13/2013 5:30 AM, Hao REN wrote:
Ok, I worked it out.
Hi Spark community,
I learned a lot the last time I posted some elementary Spark code here.
So, I thought I would do it again. Someone politely tell me offline if
this is noise or unfair use of the list! I acknowledge that this
borders on asking Scala 101 questions
I have an
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up
how many times each column is populated in an RDD. Here is the code I
came up with:
//RDD of List[String] corresponding to tab delimited values
val columns = spark.textFile(myfile.tsv).map(line
can
collect at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com wrote:
Hi Spark coders,
I wrote my first little Spark job that takes columnar data and counts up how
many times each column is populated in an RDD. Here is the code I came up
with:
//RDD
an ID for the column (maybe its index) and a flag for
whether it's present.
Then you reduce by key to get the per-column count. Then you can
collect at the end.
- Patrick
On Fri, Nov 8, 2013 at 1:15 PM, Philip Ogren philip.og...@oracle.com
wrote:
Hi Spark coders,
I wrote my first little Spark
On the front page http://spark.incubator.apache.org/ of the Spark
website there is the following simple word count implementation:
file = spark.textFile(hdfs://...)
file.flatMap(line = line.split( )).map(word = (word,
1)).reduceByKey(_ + _)
The same code can be found in the Quick Start
for third-party
apps.
Matei
On Nov 7, 2013, at 1:15 PM, Philip Ogren philip.og...@oracle.com
mailto:philip.og...@oracle.com wrote:
I remember running into something very similar when trying to perform
a foreach on java.util.List and I fixed it by adding the following
import:
import
My team is investigating a number of technologies in the Big Data
space. A team member recently got turned on to Cascading
http://www.cascading.org/about-cascading/ as an application layer for
orchestrating complex workflows/scenarios. He asked me if Spark had an
application layer? My
Hi Arun,
I had recent success getting a Spark project set up in Eclipse Juno.
Here are the notes that I wrote down for the rest of my team that you
may perhaps find useful:
Spark version 0.8.0 requires Scala version 2.9.3. This is a bit
inconvenient because Scala is now on version 2.10.3
. if the pipeline object is null.) This seems reasonable to
me. I will try it on an actual cluster next
Thanks,
Philip
On 10/22/2013 11:50 AM, Philip Ogren wrote:
I have a text analytics pipeline that performs a sequence of steps
(e.g. tokenization, part-of-speech tagging, etc
[
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13570302#comment-13570302
]
Philip Ogren commented on DERBY-4921:
-
If you change the client driver to behave
[
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569193#comment-13569193
]
Philip Ogren commented on DERBY-4921:
-
I would like to challenge the decision to close
[
https://issues.apache.org/jira/browse/DERBY-4921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13569197#comment-13569197
]
Philip Ogren commented on DERBY-4921:
-
I prepared some code that demonstrates
[
https://issues.apache.org/jira/browse/UIMA-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12977607#action_12977607
]
Philip Ogren commented on UIMA-1983:
If I define a type called MyAnnotation in a type
Components: Core Java Framework
Reporter: Philip Ogren
Priority: Minor
When the compiler warnings are set to complain when name shadowing or name
conflicts exist, then the source files produced by JCasGen contain many
warnings. It sure would be nice if these files came out
I am wondering if it is possible to run the CAS Editor as a stand-alone
application or if it is only available as a plugin within Eclipse.
Thanks,
Philip
[
https://issues.apache.org/jira/browse/UIMA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12936054#action_12936054
]
Philip Ogren commented on UIMA-1875:
This looks really promising - thanks
Borobudur,
Do you mean training in the machine learning sense? If so, UIMA does
not directly support any notion of training and using statistical
classifiers. You might check out ClearTK
http://cleartk.googlecode.com which is a UIMA-based project that
provides support for a number of
...@uima.apache.org] Im Auftrag
von Philip Ogren
Gesendet: Montag, 1. November 2010 16:20
An: user@uima.apache.org
Betreff: Re: Compare two CASes
Hi Armin,
I have put together some example code together using uimaFIT to address this
very common scenario. There's a wiki page that provides an entry
Hi Armin,
I have put together some example code together using uimaFIT to address
this very common scenario. There's a wiki page that provides an entry
point here:
http://code.google.com/p/uimafit/wiki/RunningExperiments
Hope this helps.
Philip
On 11/1/2010 2:09 AM,
Marshall,
Thank you for this helpful hint. I was just now following up on
Richard's tips and was trying to figure out how to get Eclipse to
recognize the new source directory. I think I now have a working solution!
I will send it around shortly.
Philip
On 10/21/2010 9:03 AM, Marshall
We are using maven as the build platform for our uima-based projects.
As part of the build org.apache.uima.tools.jcasgen.Jg is invoked to
generate java files for the type system. This is performed in the
process-resources phase using a plugin configuration previously
discussed on this list.
Research at the University of
Colorado at Boulder, and the Ubiquitous Knowledge Processing (UKP) Lab
at the Technische Universität Darmstadt. uimaFIT is extensively used by
projects being developed by these groups.
The uimaFIT development team is:
Philip Ogren, University of Colorado, USA
I am involved in a discussion about how to do logging in uima. We were
looking at section 1.2.2 in the tutorial for some motivation as to why
we would use the built-in UIMA logging rather than just using e.g. log4j
directly - but there doesn't seem to be any. Could someone give us some
I have a component that takes text from the _InitialView view, creates
a second view, and posts a modified version of the text to the second
view. I had a unit test that was reading in a JCas from an XMI file and
running the JCas through my annotator and testing that the text was
correctly
Ram,
You might check out the uutuc project at:
http://code.google.com/p/uutuc/
The main goal of this project is to make it easier to dynamically
describe and instantiate uima components. The project started off as
utility classes for unit testing - but has really become a dynamic
Girish,
I have done exactly the same thing as you minus step 2 below without any
problems. The only caveat being that this didn't seem (as I recall) to
trigger my analysis engines initialize() method and so I had to reread
the parameter in my analysis engine's process() method.
I don't
One thing that you might consider doing is putting the path information
into its own view. That is, create a new view and set its document path
to be the path/uri. One advantage of this is that if you have a
CollectionReader that is otherwise type system agnostic you don't have
to pollute it
It may be worth pointing out that there is a very nice set of uima
wrappers for OpenNLP available from their sourceforge cvs repository.
See http://opennlp.cvs.sourceforge.net/opennlp/. While this is still a
work in progress - it is *much* nicer than the example wrappers that
ship with UIMA.
We have posted a light-weight set of utility classes that ease the
burden of unit testing UIMA components. The project is located at:
http://uutuc.googlecode.com/
and is licensed under ASL 2.0.
There is very little documentation for this library at the moment - just
a bare-bones getting
We have assembled some misc. utility methods that make unit testing
easier to support our UIMA-based project ClearTK. I have come across
several scenarios now where I wish that this code was available as a
separate project so that I don't have to create a dependency on our
entire ClearTK
Is it possible to update the payloads of an existing index? I having
troubles finding any mention of this on the mailing list archives and it
is not obvious that this is possible from the api. I do not want to
change the size of the payloads - just update the values. My payloads
values
Marshall,
This may be frustrating/annoying feedback. Last summer I sent a few
emails to the list about unit testing uima components using groovy and
probably said a few other positive things about groovy. We have since
abandoned groovy for a variety of reasons. Here are a few:
- unit
Katrin,
Yes. There is a penalty for iterating through all the annotations of a
given type. Imagine you have a token annotation and a document with 10K
tokens (not uncommon).
We wrote a method that doesn't have this performance penalty and
bypasses the type priorities.
Please see:
I didn't follow the thread closely so I may be wandering here - but I
thought I would volunteer my working strategy for testing collection
readers in Groovy even though it may be overly simplistic for many
situations.
My unit tests for our collection readers start off with one line:
JCas
:
That's a bug. The underlying implementation of the two
iterator types you mention is totally different, hence
you see this only in one of them. Any chance you could
provide a self-contained test case that exhibits this?
--Thilo
Philip Ogren wrote:
I am having difficulty with using
that,
please check the box that says something like, ok to include in
Apache code (so we can check it in and use it as regression test).
Groovy, hm. Never used it before. If it doesn't take me more than
5 min to set up in Eclipse, and I can still debug, not a problem ;-)
--Thilo
Philip Ogren wrote
Thilo had pointed me towards the method typeSystemInit() in a recent
posting as a way of getting type system information in an annotator. Is
there a reason that this method exists in CasAnnotator_ImplBase but not
JCasAnnotator_ImplBase? Or is this an omission? My intuition is that
might
[
https://issues.apache.org/jira/browse/UIMA-464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Philip Ogren updated UIMA-464:
--
Attachment: UIMA-464.zip
Please see README in the top level of the directory.
If there is something I
Attached is a type system descriptor file that isolates the bug. I
cannot create a subtype of TestType in the CDE.
Philip Ogren wrote:
I was just putting some unit tests together and was editing a type
system and noticed that I can't seem to subtype a type that is a
subtype
Sorry for the noise! A little investigation reveals that this behavior
is almost certainly by design. Changing the source by hand gives an
error message that says don't do that and section 2.3.4 also of the
UIMA References also documents this.
Philip Ogren wrote:
Attached is a type system
to the CasEditor to
uima-user so we don't all have to go hunting for it?
Thanks,
Andrew Borthwick
On 6/7/07, Philip Ogren [EMAIL PROTECTED] wrote:
Also note that we have a contribution by Joern Kottmann in the
sandbox called
CAS editor. This is Eclipse based tooling that also allows you
to manually
I have three related questions that I decided to split up into three
messages. I composed them as one email initially and decided I could be
spawning a hard-to-traverse thread. Advanced apologies for the inundation.
I am trying to create an Index Collection Descriptor File so that I
can
Is there any simple way to ask for the token 3 to the left of my
current token? I can't find anything that is built into the default
annotation index, and so I have defined an index for this in the
descriptor file. In order to do this I define a feature in my token
type that keeps track of
My initial thought was to have a CasConsumer that loads annotations
directly into Knowtator programmatically, and a CasInitializer that goes
the other way. What remains is to have a way to translate/synchronize
the Type System in UIMA with the class hierarchy / annotation schema in
Knowtator
I'm glad I happened to browse the archive today! I just joined the list
today because I have noticed a couple of bugs that I want to post
somewhere. So, I developed and maintain Knowtator and am also steeped in
UIMA technology - I have been using it for just over a year and a half
now. I would
If you create a type with the name 'Feature' you get compile errors
because of a namespace conflict with the Feature interface. I think
this could be easily fixed by simply removing the import statement in
the generated code and explicitly providing the fully qualified name for
the Feature
80 matches
Mail list logo