Mailed our list - will send it to Spark Dev
On Fri, Sep 5, 2014 at 11:28 AM, Rajat Gupta rgu...@qubole.com wrote:
+1 on this. First step to more automated autoscaling of spark application
master...
On Fri, Sep 5, 2014 at 12:56 AM, Praveen Seluka psel...@qubole.com
wrote:
+user
On
Hello All,
I am trying to query a Hive table using Spark SQL from my java code,but
getting the following error:
*Caused by: org.apache.spark.SparkException: Job aborted due to stage
failure: Task not serializable: java.io.NotSerializableException:
Hi, thanks for the reply.
Actually, I think you are correct about the native library not being thread
safe. However, for what I understand, different cores should start different
processes, being these independent from one another. If I am not mistaken,
the thread safe issue would start being
How about:
val range = Range.getRange.toString
val notWorking = path/output_{ + range +}/*/*
On Fri, Sep 5, 2014 at 3:45 AM, jerryye jerr...@gmail.com wrote:
Hi,
I have a quick serialization issue. I'm trying to read a date range of input
files and I'm getting a serialization issue when using
Thanks for the response Sean.
As a correction. The code I provided actually ended up working. I tried to
reduce my code down but I was being overzealous and running count actually
works.
The minimal code that triggers the problem is this:
val userProfiles = lines.map(line =
Hmmm, even i dont understand why TheMain needs to be serializable. It might
be cleaner (as in avoid such mysterious closure issues) to actually create
a separate sbt/maven project (instead of a shell) and run the streaming
application from there.
TD
On Thu, Sep 4, 2014 at 10:25 AM, kpeng1
Hi ,
We avaluating PySpark and successfully executed examples of PySpark on
Yarn.
Next step what we want to do:
We have a python project ( bunch of python script using Anaconda
packages).
Question:
What is the way to execute PySpark on Yarn having a lot of python
files ( ~ 50)?
Hi Dhimant,
We also cleaned up these needless warnings on port failover in Spark 1.1 --
see https://issues.apache.org/jira/browse/SPARK-1902
Andrew
On Thu, Sep 4, 2014 at 7:38 AM, Dhimant dhimant84.jays...@gmail.com wrote:
Thanks Yana,
I am able to execute application and command via
Hi,
FYI: There is bug in Java mapPartitions - SPARK-3369
https://issues.apache.org/jira/browse/SPARK-3369. In Java results from
mapPartitions and similar functions must fit in memory. Look at example
below - it returns List.
Lukas
On 1.9.2014 00:50, Matei Zaharia wrote:
mapPartitions just
Hi Andrew,
thank you very much for your answer. Unfortunately it still doesn't work.
I'm using Spark 1.0.0, and I start history server running
sbin/start-history-server.sh dir, although I also set
SPARK_HISTORY_OPTS=-Dspark.history.fs.logDirectory in conf/spark-env.sh. I
tried also other dir
Hi,
Does Spark support recursive calls?
Hi,
var cachedPeers: Seq[BlockManagerId] = null
private def replicate(blockId: String, data: ByteBuffer, level:
StorageLevel) {
val tLevel = StorageLevel(level.useDisk, level.useMemory,
level.deserialized, 1)
if (cachedPeers == null) {
cachedPeers = master.getPeers(blockManagerId,
Hi,
Due to some limitations, we are having to stick to Kafka 0.7.x. We would
like to use as latest a version of Spark in streaming mode that integrates
with Kafka 0.7. The latest version supports only 0.8 it appears. Has anyone
solved such a requirement ? Any tips on what can be tried ?
FWIW, I
Thank you, Ankur! :)
But how to assign the storage level to a new vertices RDD that mapped from
an existing vertices RDD,
e.g.
*val newVertexRDD =
graph.collectNeighborIds(EdgeDirection.Out).map{case(id:VertexId,
a:Array[VertexId]) = (id, initialHashMap(a))}*
the new one will be combined with
Is this possible? If i have a cluster set up on EC2 and I want to run
spark-shell --master my master IP on EC2:7077 from my home computer -
is this possible at all or am I wasting my time ;)? I am seeing a
connection timeout when I try it.
Thanks!
Ognen
As far as I know in other to deploy and execute jobs in EC2 you need to
assembly you project, copy your jar into the cluster, log into using ssh
and submit the job.
To avoid having to do this I've been prototyping an sbt plugin(1) that
allows to create and send Spark jobs to an Amazon EC2 cluster
\o/ = will test it soon or sooner, gr8 idea btw
aℕdy ℙetrella
about.me/noootsab
[image: aℕdy ℙetrella on about.me]
http://about.me/noootsab
On Fri, Sep 5, 2014 at 12:37 PM, Felix Garcia Borrego fborr...@gilt.com
wrote:
As far as I know in other to deploy and execute jobs in EC2 you need to
I am building spark-1.0-rc4 with maven,following
http://spark.apache.org/docs/latest/building-with-maven.html
But running graphx edgeFileList ,some tasks failed
error:requested array size exceed vm limits
error:executor lost
Can anyone know how to fit it
Hello,
Thanks for the explanation. So events are stored internally as JSON, but
there is no official support for having Spark serve that JSON via HTTP? So if I
wanted to write an app that monitors Spark, I would either have to scrape the
web UI in HTML or rely on unofficial JSON
Hi,
I'm trying to migrate a map-reduce program to work with spark. I migrated
the program from Java to Scala. The map-reduce program basically loads a
HDFS file and for each line in the file it applies several transformation
functions available in various external libraries.
When I execute this
You can bring those classes out of the library and Serialize it (implements
Serializable). It is not the right way of doing it though it solved few of
my similar problems.
Thanks
Best Regards
On Fri, Sep 5, 2014 at 7:36 PM, Sarath Chandra
sarathchandra.jos...@algofusiontech.com wrote:
Hi,
Hi Akhil,
I've done this for the classes which are in my scope. But what to do with
classes that are out of my scope?
For example org.apache.hadoop.io.Text
Also I'm using several 3rd part libraries like jeval.
~Sarath
On Fri, Sep 5, 2014 at 7:40 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Hi,
My version of Spark is 1.0.2.
I am trying to use Spark-cassandra-connector to execute an update csql
statement inside
an CassandraConnector(conf).withSessionDo block :
CassandraConnector(conf).withSessionDo {
session =
{
myRdd.foreach {
case (ip,
My approach may be partly influenced by my limited experience with SQL and
Hive, but I just converted all my dates to seconds-since-epoch and then
selected samples from specific time ranges using integer comparisons.
On Thu, Sep 4, 2014 at 6:38 PM, Cheng, Hao hao.ch...@intel.com wrote:
There
Hi,
Whenever I replicate an rdd, I find that the rdd gets replicated only in
one node. I have a 3 node cluster.
I set rdd.persist(StorageLevel.MEMORY_ONLY_2) in my application.
The webUI shows that its replicates twice. But, the rdd stogare details
show that its replicated only once and only in
Hi Oleg,
In order to simplify the process of package and distribute you codes,
you could deploy
an shared storage (such as NFS), and put your project in it, mount it
to all the slaves
as /projects.
In the spark job scripts, you can access your project by put the path
into sys.path, such
as:
After searching a little bit, I came to know that Spark 0.8 supports
kafka-0.7. So, I tried to use it this way:
In my pom.xml, specified a Spark dependency as follows:
dependency
groupIdorg.apache.spark/groupId
artifactIdspark-streaming_2.9.3/artifactId
After that long mail :-), I think I figured it out. I removed the
'provided' tag in my pom.xml and let the jars be directly included using
maven's jar-with-dependencies plugin. Things started working after that.
Thanks
Hemanth
On Fri, Sep 5, 2014 at 9:50 PM, Hemanth Yamijala yhema...@gmail.com
Err... there's no such feature?
Jianshi
On Wed, Sep 3, 2014 at 7:03 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Hi,
How can I list all registered tables in a sql context?
--
Jianshi Huang
LinkedIn: jianshi
Twitter: @jshuang
Github Blog: http://huangjs.github.com/
--
Preprocessing (after loading the data into HDFS).
I started with data in JSON format in text files (stored in HDFS), and then
loaded the data into parquet files with a bit of preprocessing and now I
always retrieve the data by creating a SchemaRDD from the parquet file and
using the SchemaRDD to
Hi:
Curious... is there any reason not to use one of the below pyspark options
(in red)? Assuming each file is, say 10k in size, is 50 files too much?
Does that touch on some practical limitation?
Usage: ./bin/pyspark [options]
Options:
--master MASTER_URL spark://host:port,
Which version are you using -- I can reproduce your issue w/ 0.9.2 but not
with 1.0.1...so my guess is that it's a bug and the fix hasn't been
backported... No idea on a workaround though..
On Fri, Sep 5, 2014 at 7:58 AM, Dhimant dhimant84.jays...@gmail.com wrote:
Hi,
I am getting type
Hey Gerard,
Spark Streaming should just queue the processing and not delete the block
data. There are reports of this error and I am still unable to reproduce
the problem. One workaround you can try the configuration
spark.streaming.unpersist = false . This stops Spark Streaming from
cleaning up
Ok , I didn't explain my self correct:
In case of java having a lot of classes jar should be used.
All examples for PySpark I found is one py script( Pi , wordcount ...) ,
but in real environment analytics has more then one py file.
My question is how to use PySpark on Yarn analytics in
I need to use a broadcast variable inside the Decoder I use for class parameter
T in org.apache.spark.streaming.kafka.KafkaUtils.createStream. I am using the
override with this signature:
Hi Oleg,
We do support serving python files in zips. If you use --py-files, you can
provide a comma delimited list of zips instead of python files. This will
allow you to automatically add these files to the python path on the
executors without you having to manually copy them to every single
On Fri, Sep 5, 2014 at 10:50 AM, Davies Liu dav...@databricks.com wrote:
In daily development, it's common to modify your projects and re-run
the jobs. If using zip or egg to package your code, you need to do
this every time after modification, I think it will be boring.
That's why shell
Hi Grzegorz,
Can you verify that there are APPLICATION_COMPLETE files in the event log
directories? E.g. Does
file:/tmp/spark-events/app-name-1234567890/APPLICATION_COMPLETE exist? If
not, it could be that your application didn't call sc.stop(), so the
ApplicationEnd event is not actually logged.
Sure, you can request it by filing an issue here:
https://issues.apache.org/jira/browse/SPARK
2014-09-05 6:50 GMT-07:00 Ruebenacker, Oliver A
oliver.ruebenac...@altisource.com:
Hello,
Thanks for the explanation. So events are stored internally as JSON, but
there is no official
I am not sure if there is a good, clean way to do that - broadcasts
variables are not designed to be used out side spark job closures. You
could try a bit of a hacky stuff where you write the serialized variable to
file in HDFS / NFS / distributed files sytem, and then use a custom decoder
class
At 2014-09-05 21:40:51 +0800, marylucy qaz163wsx_...@hotmail.com wrote:
But running graphx edgeFileList ,some tasks failed
error:requested array size exceed vm limits
Try passing a higher value for minEdgePartitions when calling
GraphLoader.edgeListFile.
Ankur
repost since original msg was marked with This post has NOT been accepted by
the mailing list yet.
I have some questions regarding DStream batch interval:
1. if it only take 0.5 second to process the batch 99% of time, but 1% of
batches need 5 seconds to process (due to some random factor or
I'm running into a problem getting this working as well. I have spark-submit
and spark-shell working fine, but pyspark in interactive mode can't seem to
find the lzo jar:
java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not
found
This is in
good question, soumitra. it's a bit confusing.
to break TD's code down a bit:
dstream.count() is a transformation operation (returns a new DStream),
executes lazily, runs in the cluster on the underlying RDDs that come
through in that batch, and returns a new DStream with a single element
Here is a store about how shared storage simplify all the things:
In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.
We put all the data and tools and code in it, so we can access them easily on
all the machines,
Hi Davies,
On Fri, Sep 5, 2014 at 1:04 PM, Davies Liu dav...@databricks.com wrote:
In Douban, we use Moose FS[1] instead of HDFS as the distributed file system,
it's POSIX compatible and can be mounted just as NFS.
Sure, if you already have the infrastructure in place, it might be
worthwhile
I wonder if anyone has any tips for using repartition?
It seems that when you call the repartition method, the entire RDD gets
split up, shuffled, and redistributed... This is an extremely heavy task if
you have a large hdfs dataset and all you want to do is make sure your RDD
is balance/ data
Good explanation, Chris :)
On Fri, Sep 5, 2014 at 12:42 PM, Chris Fregly ch...@fregly.com wrote:
good question, soumitra. it's a bit confusing.
to break TD's code down a bit:
dstream.count() is a transformation operation (returns a new DStream),
executes lazily, runs in the cluster on
I think that should be possible. Make sure spark is installed on your local
machine and is the same version as on the cluster.
--
View this message in context:
On 9/5/2014 3:27 PM, anthonyjschu...@gmail.com wrote:
I think that should be possible. Make sure spark is installed on your local
machine and is the same version as on the cluster.
It is the same version, I can telnet to master:7077 but when I run the
spark-shell it times out.
YOu are right,but unfortunately the size of the matrices that i am trying to
craft it will outperform the capacities of the machines that i have access
to,that is why i need sparse libraries.
i wish there is a good library for sparse matrices in java,i will try to
check scala that you mentioned.
the command should be spark-shell --master spark://master ip on EC2:7077.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Running-spark-shell-or-queries-over-the-network-not-from-master-tp13543p13593.html
Sent from the Apache Spark User List mailing list
That is the command I ran and it still times out.Besides 7077 is there
any other port that needs to be open?
Thanks!
Ognen
On 9/5/2014 4:10 PM, qihong wrote:
the command should be spark-shell --master spark://master ip on EC2:7077.
--
View this message in context:
Since you are using your home computer, so it's probably not reachable by EC2
from internet.
You can try to set spark.driver.host to your WAN ip, spark.driver.port
to a fixed port in SparkConf, and open that port in your home network (port
forwarding to the computer you are using). see if that
Ah. So there is some kind of a back and forth going on. Thanks!
Ognen
On 9/5/2014 5:34 PM, qihong wrote:
Since you are using your home computer, so it's probably not reachable by EC2
from internet.
You can try to set spark.driver.host to your WAN ip, spark.driver.port
to a fixed port in
Hey - I’m struggling with some dependency issues with org.apache.httpcomponents
httpcore and httpclient when using spark-submit with YARN running Spark 1.0.2
on a Hadoop 2.2 cluster. I’ve seen several posts about this issue, but no
resolution.
The error message is this:
Caused by:
I set 200,it remain failed in second step,(map and mapPartition in webui)
In spark1.0.2 stable version ,it works well in first step,configuration same as
1.1.0
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For
Hi Reza,
Have you compared with the brute force algorithm for similarity computation
with something like the following in Spark ?
https://github.com/echen/scaldingale
I am adding cosine similarity computation but I do want to compute an all
pair similarities...
Note that the data is sparse for
Hi Deb,
We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
https://github.com/apache/spark/pull/1778
Your question wasn't entirely clear - does this answer it?
Best,
Reza
On Fri, Sep 5, 2014 at 6:14 PM, Debasish Das debasish.da...@gmail.com
wrote:
Hi Reza,
Have you
Ohh coolall-pairs brute force is also part of this PR ? Let me pull it
in and test on our dataset...
Thanks.
Deb
On Fri, Sep 5, 2014 at 7:40 PM, Reza Zadeh r...@databricks.com wrote:
Hi Deb,
We are adding all-pairs and thresholded all-pairs via dimsum in this PR:
You might want to wait until Wednesday since the interface will be changing
in that PR before Wednesday, probably over the weekend, so that you don't
have to redo your code. Your call if you need it before a week.
Reza
On Fri, Sep 5, 2014 at 7:43 PM, Debasish Das debasish.da...@gmail.com
wrote:
Ok...just to make sure I have RowMatrix[SparseVector] where rows are ~ 60M
and columns are 10M say with billion data points...
I have another version that's around 60M and ~ 10K...
I guess for the second one both all pair and dimsum will run fine...
But for tall and wide, what do you suggest ?
Also for tall and wide (rows ~60M, columns 10M), I am considering running a
matrix factorization to reduce the dimension to say ~60M x 50 and then run
all pair similarity...
Did you also try similar ideas and saw positive results ?
On Fri, Sep 5, 2014 at 7:54 PM, Debasish Das
For 60M x 10K brute force and dimsum thresholding should be fine.
For 60M x 10M probably brute force won't work depending on the cluster's
power, and dimsum thresholding should work with appropriate threshold.
Dimensionality reduction should help, and how effective it is will depend
on your
I looked at the code: similarColumns(Double.posInf) is generating the brute
force...
Basically dimsum with gamma as PositiveInfinity will produce the exact same
result as doing catesian products of RDD[(product, vector)] and computing
similarities or there will be some approximation ?
Sorry I
Yes you're right, calling dimsum with gamma as PositiveInfinity turns it
into the usual brute force algorithm for cosine similarity, there is no
sampling. This is by design.
On Fri, Sep 5, 2014 at 8:20 PM, Debasish Das debasish.da...@gmail.com
wrote:
I looked at the code:
Awesome...Let me try it out...
Any plans of putting other similarity measures in future (jaccard is
something that will be useful) ? I guess it makes sense to add some
similarity measures in mllib...
On Fri, Sep 5, 2014 at 8:55 PM, Reza Zadeh r...@databricks.com wrote:
Yes you're right,
I will add dice, overlap, and jaccard similarity in a future PR, probably
still for 1.2
On Fri, Sep 5, 2014 at 9:15 PM, Debasish Das debasish.da...@gmail.com
wrote:
Awesome...Let me try it out...
Any plans of putting other similarity measures in future (jaccard is
something that will be
Hi,
I have an input file which consists of stc_node dest_node
I have created and RDD consisting of key-value pair where key is the node
id and the values are the children of that node.
Now I want to associate a byte with each node. For that I have created a
byte array.
Every time I print out the
69 matches
Mail list logo