Exception{
LineNumberReader rdr = new LineNumberReader(new FileReader(f));
StringBuilder sb = new StringBuilder();
String line = rdr.readLine();
while(line != null) {
sb.append(line);
sb.append("\n");
line = rdr.readLine();
are within a map step.
>
> Generally you should not call external applications from Spark.
>
> > Am 11.11.2018 um 23:13 schrieb Steve Lewis :
> >
> > I have a problem where a critical step needs to be performed by a third
> party c++ application. I can send or install
I have a problem where a critical step needs to be performed by a third
party c++ application. I can send or install this program on the worker
nodes. I can construct a function holding all the data this program needs
to process. The problem is that the program is designed to read and write
from
We are trying to run a job that has previously run on Spark 1.3 on a
different cluster. The job was converted to 2.3 spark and this is a
new cluster.
The job dies after completing about a half dozen stages with
java.io.IOException: No space left on device
It appears that the nodes are
Ok I am stymied. I have tried everything I can think of to get spark to use
my own version of
log4j.properties
In the launcher code - I launch a local instance from a Java application
I say -Dlog4j.configuration=conf/log4j.properties
where conf/log4j.properties is user.dir - no luck
Spark
I asked a similar question a day or so ago but this is a much more concrete
example showing the difficulty I am running into
I am trying to use DataSets. I have an object which I want to encode with
its fields as columns. The object is a well behaved Java Bean.
However one field is an object (or
I have a relatively complex Java object that I would like to use in a
dataset
if I say
Encoder evidence = Encoders.kryo(MyType.class);
JavaRDD rddMyType= generateRDD(); // some code
Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),
evidence);
I get one column - the whole object
assume I have the following code
SparkConf sparkConf = new SparkConf();
JavaSparkContext sqlCtx= new JavaSparkContext(sparkConf);
JavaRDD rddMyType= generateRDD(); // some code
Encoder evidence = Encoders.kryo(MyType.class);
Dataset datasetMyType= sqlCtx.createDataset( rddMyType.rdd(),
columns. Try running: datasetMyType.printSchema()
>
> On Mon, Jan 25, 2016 at 1:16 PM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>> assume I have the following code
>>
>> SparkConf sparkConf = new SparkConf();
>>
>> JavaSparkContext sqlCtx= new JavaSp
ou can also do this using lambda functions if you want though:
>
> ds1.groupBy(_.region).cogroup(ds2.groupBy(_.region) { (key, iter1, iter2)
> =>
> ...
> }
>
>
> On Wed, Jan 20, 2016 at 10:26 AM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>> We
We have been working a large search problem which we have been solving in
the following ways.
We have two sets of objects, say children and schools. The object is to
find the closest school to each child. There is a distance measure but it
is relatively expensive and would be very costly to apply
I am running on a spark 1.5.1 cluster managed by Mesos - I have an
application that handled a chemistry problem which can be increased by
increasing the number of atoms - increasing the number of Spark stages. I
do a repartition at each stage - Stage 9 is the last stage. At each stage
the size and
I have been using my own code to build the jar file I use for spark
submit. In 1.4 I could simply add all class and resource files I find in
the class path to the jar and add all jars in the classpath into a
directory called lib in the jar file.
In 1.5 I see that resources and classes in jars in
I was in a discussion with someone who works for a cloud provider which
offers Spark/Hadoop services. We got into a discussion of performance and
the bewildering array of machine types and the problem of selecting a
cluster with 20 Large instances VS 10 Jumbo instances or the trade offs
between
once more. Tasks typically
preform both operations several hundred thousand times. why it can not be
done distributed way?
On Thu, May 7, 2015 at 3:16 PM, Steve Lewis lordjoe2...@gmail.com wrote:
I am performing a job where I perform a number of steps in succession.
One step is a map
I am performing a job where I perform a number of steps in succession.
One step is a map on a JavaRDD which generates objects taking up
significant memory.
The this is followed by a join and an aggregateByKey.
The problem is that the system is running getting OutOfMemoryErrors -
Most tasks work
null
Sent from Samsung Mobile
Original message
From: Olivier Girardot
Date:2015/04/18 22:04 (GMT+00:00)
To: Steve Lewis ,user@spark.apache.org
Subject: Re: Can a map function return null
You can return an RDD with null values inside, and afterwards filter on
item
I find a number of cases where I have an JavaRDD and I wish to transform
the data and depending on a test return 0 or one item (don't suggest a
filter - the real case is more complex). So I currently do something like
the following - perform a flatmap returning a list with 0 or 1 entry
depending
-- Forwarded message --
From: Steve Lewis lordjoe2...@gmail.com
Date: Wed, Mar 11, 2015 at 9:13 AM
Subject: Re: Numbering RDD members Sequentially
To: Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com
perfect - exactly what I was looking for, not quite sure why it is
called
I have Hadoop Input Format which reads records and produces
JavaPairRDDString,String locatedData where
_1() is a formatted version of the file location - like
12690,, 24386 .27523 ...
_2() is data to be processed
For historical reasons I want to convert _1() into in integer
I have an application where a function needs access to the results of a
select from a parquet database. Creating a JavaSQLContext and from it
a JavaSchemaRDD
as shown below works but the parallelism is not needed - a simple JDBC call
would work -
Are there alternative non-parallel ways to
I notice new methods such as JavaSparkContext makeRDD (with few useful
examples) - It takes a Seq but while there are ways to turn a list into a
Seq I see nothing that uses an Iterable
I am aware of the ADAM project in Berkeley and I am working on Proteomic
searches -
anyone else working in this space
I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following converting my RDD (scans) to a local Iterator. This
works but hasNext shows up as a separate task
output file.
- SF
On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis lordjoe2...@gmail.com
wrote:
I have an RDD which is potentially too large to store in memory with
collect. I want a single task to write the contents as a file to hdfs. Time
is not a large issue but memory is.
I say the following
temp dirs if it has to.
On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
The objective is to let the Spark application generate a file in a format
which can be consumed by other programs - as I said I am willing to give up
parallelism at this stage (all the expensive
thread.
On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis lordjoe2...@gmail.com wrote:
I have a function which generates a Java object and I want to explore
failures which only happen when processing large numbers of these object.
the real code is reading a many gigabyte file but in the test code I
assume I don't care about values which may be created in a later map - in
scala I can say
val rdd = sc.parallelize(1 to 10, numSlices = 1000)
but in Java JavaSparkContext can only paralellize a List - limited to
Integer,MAX_VALUE elements and required to exist in memory - the best I can
I am using a custom hadoop input format which works well on smaller files
but fails with a file at about 4GB size - the format is generating about
800 splits and all variables in my code are longs -
Any suggestions? Is anyone reading files of this size?
Exception in thread main
I am trying to look at problems reading a data file over 4G. In my testing
I am trying to create such a file.
My plan is to create a fasta file (a simple format used in biology)
looking like
1
TCCTTACGGAGTTCGGGTGTTTATCTTACTTATCGCGGTTCGCTGCCGCTCCGGGAGCCCGGATAGGCTGCGTTAATACCTAAGGAGCGCGTATTG
2
I am running a large job using 4000 partitions - after running for four
hours on a 16 node cluster it fails with the following message.
The errors are in spark code and seem address unreliability at the level of
the disk -
Anyone seen this and know what is going on and how to fix it.
Exception
. In particular, if the RDDs are PairRDDs,
partitions are assigned based on the hash of the key, so an even
distribution of values among keys is required for even split of data across
partitions.
On December 2, 2014 at 4:15:25 PM, Steve Lewis (lordjoe2...@gmail.com)
wrote:
1) I can go
https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/TaskContext.java
has a Java implementation if TaskContext wit a very useful method
/** * Return the currently active TaskContext. This can be called inside of
* user functions to access contextual information about
I have been working on balancing work across a number of partitions and
find it would be useful to access information about the current execution
environment much of which (like Executor ID) are available if there was a
way to get the current executor or the Hadoop TaskAttempt context -
does any
period, you can
identify 'hot spots' or expensive sections in the user code.
On Tue, Dec 2, 2014 at 3:03 PM, Steve Lewis lordjoe2...@gmail.com wrote:
I am working on a problem which will eventually involve many millions of
function calls. A have a small sample with several thousand calls working
I am running on a 15 node cluster and am trying to set partitioning to
balance the work across all nodes. I am using an Accumulator to track work
by Mac Address but would prefer to use data known to the Spark environment
- Executor ID, and Function ID show up in the Spark UI and Task ID and
I am running on a 15 node cluster and am trying to set partitioning to
balance the work across all nodes. I am using an Accumulator to track work
by Mac Address but would prefer to use data known to the Spark environment
- Executor ID, and Function ID show up in the Spark UI and Task ID and
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs. There are
on the order of 100 million elements
I call a function to rearrange the tuples
JavaPairRDDString,Tuple2Type1,Type2 newPairs =
originalPairs.values().mapToPair(new PairFunctionTuple2Type1,Type2,
String, Tuple2IType1,Type2
. An alternative is to make the .equals() and .hashcode() of
KeyType delegate to the .getId() method you use in the anonymous function.
Cheers,
Andrew
On Tue, Nov 25, 2014 at 10:06 AM, Steve Lewis lordjoe2...@gmail.com
wrote:
I have an JavaPairRDDKeyType,Tuple2Type1,Type2 originalPairs
The spark UI lists a number of Executor IDS on the cluster. I would like
to access both executor ID and Task/Attempt IDs from the code inside a
function running on a slave machine.
Currently my motivation is to examine parallelism and locality but in
Hadoop this aids in allowing code to write
I have instrumented word count to track how many machines the code runs
on. I use an accumulator to maintain a Set or MacAddresses. I find that
everything is done on a single machine. This is probably optimal for word
count but not the larger problems I am working on.
How to a force processing to
to force it to have more
partitions, you can call RDD.repartition(numPartitions). Note that this
will introduce a shuffle you wouldn't otherwise have.
Also make sure your job is allocated more than one core in your cluster
(you can see this on the web UI).
On Fri, Nov 14, 2014 at 2:18 PM, Steve
I am trying to determine how effective partitioning is at parallelizing my
tasks. So far I suspect it that all work is done in one task. My plan is to
create a number of accumulators - one for each task and have functions
increment the accumulator for the appropriate task (or slave) the values
JavaSparkContext currentContext = ...;
AccumulatorInteger accumulator = currentContext.accumulator(0,
MyAccumulator);
will create an Accumulator of Integers. For many large Data problems
Integer is too small and Long is a better type.
I see a call like the following
JavaSparkContext has helper methods for int and double but not long. You
can just make your own little implementation of AccumulatorParamLong
right? ... which would be nice to add to JavaSparkContext.
On Wed, Nov 12, 2014 at 11:05 PM, Steve Lewis lordjoe2...@gmail.com
wrote
In my problem I have a number of intermediate JavaRDDs and would like to
be able to look at their sizes without destroying the RDD for sibsequent
processing. persist will do this but these are big and perisist seems
expensive and I am unsure of which StorageLevel is needed, Is there a way
to
I see the job in the web interface but don't know how to kill it
,key join with JavaPairRDDbin,lock
If you partition both RDDs by the bin id, I think you should be able to
get what you want.
Best Regards,
Sonal
Nube Technologies http://www.nubetech.co
http://in.linkedin.com/in/sonalgoyal
On Fri, Oct 31, 2014 at 5:44 PM, Steve Lewis lordjoe2
The original problem is in biology but the following captures the CS
issues, Assume I have a large number of locks and a large number of keys.
There is a scoring function between keys and locks and a key that fits a
lock will have a high score. There may be many keys fitting one lock and a
key
Assume in my executor I say
SparkConf sparkConf = new SparkConf();
sparkConf.set(spark.kryo.registrator,
com.lordjoe.distributed.hydra.HydraKryoSerializer);
sparkConf.set(mysparc.data, Some user Data);
sparkConf.setAppName(Some App);
Now
1) Are there default
A cluster I am running on keeps getting KryoException. Unlike the Java
serializer the Kryo Exception gives no clue as to what class is giving the
error
The application runs properly locally but no the cluster and I have my own
custom KryoRegistrator and register sereral dozen classes -
objects you're trying to serialize and see if
those work.
-Original Message-
*From: *Steve Lewis [lordjoe2...@gmail.com]
*Sent: *Tuesday, October 28, 2014 10:46 PM Eastern Standard Time
*To: *user@spark.apache.org
*Subject: *com.esotericsoftware.kryo.KryoException: Encountered
Collect will store the entire output in a List in memory. This solution is
acceptable for Little Data problems although if the entire problem fits
in the memory of a single machine there is less motivation to use Spark.
Most problems which benefit from Spark are large enough that even the data
At the end of a set of computation I have a JavaRDDString . I want a
single file where each string is printed in order. The data is small enough
that it is acceptable to handle the printout on a single processor. It may
be large enough that using collect to generate a list might be unacceptable.
expose a method to iterate over the data,
called toLocalIterator. It does not require that the RDD fit entirely
in memory.
On Mon, Oct 20, 2014 at 6:13 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
At the end of a set of computation I have a JavaRDDString . I want a
single file where each
I am running a couple of functions on an RDD which require access to data
on the file system known to the context. If I create a class with a context
a a member variable I get a serialization error,
So I am running my function on some slave and I want to read in data from a
Path defined by a
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2 -
I repeatedly see
the following in my logs.
I believe this happens in combineByKey
14/10/08 09:36:30 INFO executor.Executor: Running task 3.0 in stage 0.0
(TID 3)
14/10/08 09:36:30 INFO broadcast.TorrentBroadcast:
spark.broadcast.factory
to org.apache.spark.broadcast.HttpBroadcastFactory in spark conf.
Thanks,
Liquan
On Wed, Oct 8, 2014 at 11:21 AM, Steve Lewis lordjoe2...@gmail.com
wrote:
I am running on Windows 8 using Spark 1.1.0 in local mode with Hadoop 2.2
- I repeatedly see
the following in my
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
Try a Hadoop Custom InputFormat - I can give you some samples -
While I have not tried this an input split has only a length (could be
ignores if the format treats as non splittable) and a String for a location.
If the location is a URL into wikipedia the whole thing should work.
Hadoop
I number of the problems I want to work with generate datasets which are
too large to hold in memory. This becomes an issue when building a
FlatMapFunction and also when the data used in combineByKey cannot be held
in memory.
The following is a simple, if a little silly, example of a
This sample below is essentially word count modified to be big data by
turning lines into groups of
upper case letters and then generating all case variants - it is modeled
after some real problems in biology
The issue is I know how to do this in Hadoop but in Spark the use of a List
in my flatmap
,
Liquan
On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
Well I had one and tried that - my message tells what I found found
1) Spark only accepts org.apache.hadoop.mapred.InputFormatK,V
not org.apache.hadoop.mapreduce.InputFormatK,V
2) Hadoop expects K and V
Do your custom Writable classes implement Serializable - I think that is
the only real issue - my code uses vanilla Text
Hmmm - I have only tested in local mode but I got an
java.io.NotSerializableException: org.apache.hadoop.io.Text
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1180)
at
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1528)
at
java.lang.NullPointerException
at java.nio.ByteBuffer.wrap(ByteBuffer.java:392)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
When I experimented with using an InputFormat I had used in Hadoop for a
long time in Hadoop I found
1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated
class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
2) initialize needs to be called in the constructor
3)
The only way I find is to turn it into a list - in effect holding
everything in memory (see code below). Surely Spark has a better way.
Also what about unterminated iterables like a Fibonacci series - (useful
only if limited in some other way )
/**
* make an RDD from an iterable
*
, reduce,
merge)
reducedUnsorted.sortByKey()
On Fri, Sep 19, 2014 at 1:37 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
I am struggling to reproduce the functionality of a Hadoop reducer on
Spark (in Java)
in Hadoop I have a function
public void doReduce(K key, IteratorV values)
in Hadoop
Assume I have a large book with many Chapters and many lines of text.
Assume I have a function that tells me the similarity of two lines of
text. The objective is to find the most similar line in the same chapter
within 200 lines of the line found.
The real problem involves biology and is beyond
In modern projects there are a bazillion dependencies - when I use Hadoop I
just put them in a lib directory in the jar - If I have a project that
depends on 50 jars I need a way to deliver them to Spark - maybe wordcount
can be written without dependencies but real projects need to deliver
In a Hadoop jar there is a directory called lib and all non-provided third
party jars go there and are included in the class path of the code. Do jars
for Spark have the same structure - another way to ask the question is if I
have code to execute Spark and a jar build for Hadoop can I simply use
worrying about this.
Matei
On August 30, 2014 at 9:04:37 AM, Steve Lewis (lordjoe2...@gmail.com)
wrote:
When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDDString, Integer ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a
Assume say JavaWord count
I call the equivalent of a Mapper
JavaPairRDDString, Integer ones = words.mapToPair(,,,
Now right here I want to guarantee that each word starting with a
particular letter is processed in a specific partition - (Don't tell me
this is a dumb idea - I know that but in a
, Steve Lewis (lordjoe2...@gmail.com)
wrote:
When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions
When programming in Hadoop it is possible to guarantee
1) All keys sent to a specific partition will be handled by the same
machine (thread)
2) All keys received by a specific machine (thread) will be received in
sorted order
3) These conditions will hold even if the values associated with a
In many cases when I work with Map Reduce my mapper or my reducer might
take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values
I don't think that functions like flatMap and reduceByKey will work or are
there tricks I am not aware of
I was able to get JavaWordCount running with a local instance under
IntelliJ.
In order to do so I needed to use maven to package my code and
call
String[] jars = {
/SparkExamples/target/word-count-examples_2.10-1.0.0.jar };
sparkConf.setJars(jars);
After that the sample ran properly and
invoke an action, like
count(), on the words RDD.
On Mon, Aug 25, 2014 at 6:32 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
I was able to get JavaWordCount running with a local instance under
IntelliJ.
In order to do so I needed to use maven to package my code and
call
String[] jars
I download the binaries for spark-1.0.2-hadoop1 and unpack it on my Widows
8 box.
I can execute spark-shell.com and get a command window which does the
proper things
I open a browser to http:/localhost:4040 and a window comes up describing
the spark-master
Then using IntelliJ I create a project
notes -
http://ml-nlp-ir.blogspot.com/2014/04/building-spark-on-windows-and-cloudera.html
on how to build from source and run examples in spark shell.
Regards,
Manu
On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
I want to look at porting a Hadoop problem
-spark-on-windows-and-cloudera.html
on how to build from source and run examples in spark shell.
Regards,
Manu
On Sat, Aug 16, 2014 at 12:14 PM, Steve Lewis lordjoe2...@gmail.com
wrote:
I want to look at porting a Hadoop problem to Spark - eventually I want
to run on a Hadoop 2.0 cluster
I want to look at porting a Hadoop problem to Spark - eventually I want to
run on a Hadoop 2.0 cluster but while I am learning and porting I want to
run small problems in my windows box.
I installed scala and sbt.
I download Spark and in the spark directory can say
mvn -Phadoop-0.23
84 matches
Mail list logo