hi,
I want to use spark to analyze source code :)
Since code have dependency between lines, it's not possible to just treat
it as lines. So I am considering to provide my own datasource for source
code, but there isn't much documentation about datasource api, where can I
learn to do this?
On 24 Jun 2015, at 05:55, canan chen
ccn...@gmail.commailto:ccn...@gmail.com wrote:
Why do you want it start until all the resources are ready ? Make it start as
early as possible should make it complete earlier and increase the utilization
of resources
On Tue, Jun 23, 2015 at 10:34 PM, Arun
Hi,
I've open up an issue bug on the Spark project on JIRA:
https://issues.apache.org/jira/browse/SPARK-8557
Would really appreciate some insights on the issue,
*It's strange that no one else encountered the problem.*
Have a great day!
On Mon, Jun 15, 2015 at 12:03 PM, nizang ni...@windward.eu
Model sizes are 10m x rank, 100k x rank range.
For recommendation/topic modeling I can run batch recommendAll and then
keep serving the model using a distributed cache but then I can't
incorporate per user model re-predict if user feedback is making the
current topk stale. I have to wait for next
Depending the size of the memory you are having, you ccould allocate 60-80%
of the memory for the spark worker process. Datanode doesn't require too
much memory.
On 23 Jun 2015 21:26, maxdml max...@cs.duke.edu wrote:
I'm wondering if there is a real benefit for splitting my memory in two for
Hi,
I am using spark 1.4. I wanted to serialize by KryoSerializer, but got
ClassNotFoundException. The configuration and exception is below. When I
submitted the job, I also provided --jars mylib.jar which contains
WRFVariableZ.
conf.set(spark.serializer,
Ok
My view is with only 100k items, you are better off serving in-memory for
items vectors. i.e. store all item vectors in memory, and compute user *
item score on-demand. In most applications only a small proportion of users
are active, so really you don't need all 10m user vectors in memory.
On Wed, Jun 24, 2015 at 12:02 PM, Nick Pentreath
nick.pentre...@gmail.com wrote:
Oryx does almost the same but Oryx1 kept all user and item vectors in memory
(though I am not sure about whether Oryx2 still stores all user and item
vectors in memory or partitions in some way).
(Yes, this is a
Hi,
I was wondering if it is possible to use MLlib function inside SparkR, as
outlined at the Spark Summer East 2015 Warmup meetup:
http://www.meetup.com/Spark-NYC/events/220850389/
Are there available examples?
Thank you!
Elena
--
View this message in context:
Ok so you are running Spark in a Standalone Mode then
Then for every Worker process on every node (you can run more than one Worker
per node) you will have an Executor waiting for jobs ….
As far as I am concerned I think there are only two ways to achieve what you
need:
1.
Hello,
Trying to write data from Spark to Cassandra.
Reading data from Cassandra is ok, but writing seems to give a strange
error.
Exception in thread main scala.ScalaReflectionException: none is not a
term
at scala.reflect.api.Symbols$SymbolApi$class.asTerm(Symbols.scala:259)
The
hi ,all
there two examples one is throw Task not serializable when execute in spark
shell,the other one is ok,i am very puzzled,can anyone give what's different
about this two code and why the other is ok
1.The one which throw Task not serializable :
import org.apache.spark._
import
Ok, thanks. I have 1 worker process on each machine but I would like to run
my app on only 3 of them. Is it possible?
śr., 24.06.2015 o 11:44 użytkownik Evo Eftimov evo.efti...@isecc.com
napisał:
There is no direct one to one mapping between Executor and Node
Executor is simply the spark
Hello,
Thanks for all the help on resolving this issue, especially to Cody who
guided me to the solution.
For other facing similar issues, basically the issue was that I was running
Spark Streaming jobs from the spark-shell and this is not supported. Running
the same job through spark-submit
There is no direct one to one mapping between Executor and Node
Executor is simply the spark framework term for JVM instance with some spark
framework system code running in it
A node is a physical server machine
You can have more than one JVM per node
And vice versa you can
When reading large (and many) datasets with the Spark 1.4.0 DataFrames
parquet reader (the org.apache.spark.sql.parquet format), the following
exceptions are thrown:
Exception in thread task-result-getter-0
Exception: java.lang.OutOfMemoryError thrown from the
UncaughtExceptionHandler in thread
Hi All,
I am using the new Apache Spark version 1.4.0 Data-frames API to extract
information from Twitter's Status JSON, mostly focused on the Entities Object
https://dev.twitter.com/overview/api/entities - the relevant part to this
question is showed below:
{
...
...
entities: {
I finally got back to this and I just wanted to let anyone that runs into
this know that the problem is a kryo version issue. Spark (at least 1.4.0)
depends on Kryo 2.21 while my client had 2.24.0 on the classpath. Changing
it to 2.21 fixed the problem.
--
View this message in context:
What that did was run a repartition with 174 tasks
repartition with 174 tasks
AND
actual .filter.map stage with 500 tasks
It actually doubled to stages.
On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Hi Deepak,
Parallelism is controlled by the
Yes, it will introduce a shuffle stage in order to perform the repartitioning.
So it’s more useful if you’re planning to do many downstream transformations
for which you need the increased parallelism.
Is this a dataset from HDFS?
From: ÐΞ€ρ@Ҝ (๏̯͡๏)
Date: Wednesday, June 24, 2015 at 6:11 PM
What kind of custom logic?
On 25 Jun 2015 01:33, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
We are looking to use spark as our stream processing framework and it
would be helpful if experts can weigh if we made a right choice given below
requirement
Given a stream of data we need to
Hi Burak,
Thanks for your quick reply. I guess what confuses me is that accumulator
won't be updated until an action is used due to the laziness, so
transformation such as a map won't even update the accumulator, then how
would restarted the transformation ended up updating accumulator more than
Hello,
I'd like to understand how other people have been aggregating metrics
using Spark Streaming and Cassandra database. Currently I have design
some data models that will stored the rolled up metrics. There are two
models that I am considering:
CREATE TABLE rollup_using_counters (
Any custom script ( python or java or scala )
Thanks ,
Ashish
On Jun 24, 2015, at 4:39 PM, ayan guha guha.a...@gmail.com wrote:
What kind of custom logic?
On 25 Jun 2015 01:33, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
We are looking to use spark as our stream processing
Anyone knows whether sparkR supports map and reduce operations as the RDD
transformations? Thanks in advance.
Best,
Wei
Hi Burak,
It makes sense, it boils down to any actions happens after transformations
then. Thanks for your answers.
Best,
Wei
2015-06-24 15:06 GMT-07:00 Burak Yavuz brk...@gmail.com:
Hi Wei,
During the action, all the transformations before it will occur in order
leading up to the action.
Hi Wei,
For example, when a straggler executor gets killed in the middle of a map
operation and it's task is restarted at a different instance, the
accumulator will be updated more than once.
Best,
Burak
On Wed, Jun 24, 2015 at 1:08 PM, Wei Zhou zhweisop...@gmail.com wrote:
Quoting from Spark
Hi Wei,
During the action, all the transformations before it will occur in order
leading up to the action. If you have an accumulator in any of these
transformations, then you won't get exactly once semantics, because the
transformation may be restarted elsewhere.
Bet,
Burak
On Wed, Jun 24,
When my 22M Parquet test file ended up taking 3G when cached in-memory I
looked closer at how column compression works in 1.4.0. My test dataset was
1,000 columns * 800,000 rows of mostly empty floating point columns with a
few dense long columns.
I was surprised to see that no real
Have you considered instead using the mllib SparseVector type (which is
supported in Spark SQL?)
On Wed, Jun 24, 2015 at 1:31 PM, Nikita Dolgov n...@beckon.com wrote:
When my 22M Parquet test file ended up taking 3G when cached in-memory I
looked closer at how column compression works in
Hi all,
I'm trying to run kmeans.py Spark example on Yarn cluster mode. I'm using
Spark 1.4.0.
I'm passing numpy-1.9.2.zip with --py-files flag.
Here is the command I'm trying to execute but it fails:
./bin/spark-submit --master yarn-cluster --verbose --py-files
I have a simple HQL (below). In hive it takes maybe 10 minutes to complete.
When I do this with Spark it seems to take for every. The table is
partitioned by datestamp. I am using Spark 1.3.1
How can i tune/optimize
here is the query
tumblruser=hiveCtx.sql( select s_mobile_id, receive_time
Hi Marcelo,
The issue does not happen while connecting to the hive metstore, that works
fine. It seems that HiveContext only uses Hive CLI to execute the queries
while HiveServer2 does not support it. I dont think you can specify any
configuration in hive-site.xml which can make it connect to
Can you look a bit more in the error logs? It could be getting killed
because of OOM etc. One thing you can try is to set the
spark.shuffle.blockTransferService to nio from netty.
Thanks
Best Regards
On Wed, Jun 24, 2015 at 5:46 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have a Spark job
Can you try to add those jars in the SPARK_CLASSPATH and give it a try?
Thanks
Best Regards
On Wed, Jun 24, 2015 at 12:07 AM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:
Hi folks, I have been using Spark against an external Metastore service
which runs Hive with Cdh 4.6
In Spark 1.2, I was
A screenshot of your framework running would also be helpful. How many
cores does it have?
Did you try running it in coarse grained mode?
Try to add these to the conf:
sparkConf.set(spark.mesos.coarse, true)
sparkConfset(spark.cores.max, 2)
Thanks
Best Regards
On Wed, Jun 24, 2015 at 1:35 AM,
How spark guarantees that no RDD will fail /lost during its life cycle .
Is there something like ask in storm or its does it by default .
--
Thanks Regards,
Anshu Shukla
Michael,
I have two Dataframes. A users DF, and an investments DF. The
investments DF has a column that matches the users id. I would like to
nest the collection of investments for each user and save to a parquet file.
Is there a straightforward way to do this?
Thanks.
Richard Catlin
On
Can you elaborate little more? Are you talking about receiver or streaming?
On 24 Jun 2015 23:18, anshu shukla anshushuk...@gmail.com wrote:
How spark guarantees that no RDD will fail /lost during its life cycle .
Is there something like ask in storm or its does it by default .
--
Thanks
You didn't provide any error?
You're compiling vs Hive 1.1 here and that is the problem. It is nothing to
do with CDH.
On Wed, Jun 24, 2015, 10:15 PM Aaron aarongm...@gmail.com wrote:
I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with
the v1.4.0 tag out of git?
Yeah, sorry, I didn't really answer your question due to my bias for EMR. =P
Unfortunately, also due to my bias, I have not actually tried using a straight
EC2 cluster as opposed to Spark on EMR. I'm not sure how the slave nodes get
set up when running Spark on EC2, but I would imagine that it
Hi,
here the Stack trace, thx for every help:
15/06/24 23:15:26 INFO DAGScheduler: Submitting ShuffleMapStage 19
(MapPartitionsRDD[49] at treeAggregate at LBFGS.scala:218), which has no
missing parents
[error] (dag-scheduler-event-loop) java.lang.OutOfMemoryError: Requested array
size exceeds
Just curious, would you be able to use Spark on EMR rather than on EC2?
Spark on EMR will handle lost nodes for you, and it will let you scale
your cluster up and down or clone a cluster (its config, that is, not the
data stored in HDFS), among other things. We also recently announced
official
That worked. Thanks!
I wonder what changed in 1.4 to cause this. It wouldn't work with anything
less than 256m for a simple piece of code.
1.3.1 used to work with default(64m I think)
Srikanth
On Wed, Jun 24, 2015 at 12:47 PM, Roberto Coluccio
roberto.coluc...@gmail.com wrote:
Did you try to
Yes these data sets are in HDFs.
Earlier that task completed in 25 mins.
Now its 15 + 20
On Wed, Jun 24, 2015 at 3:16 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Yes, it will introduce a shuffle stage in order to perform the
repartitioning. So it’s more useful if you’re
Hi,
According to the Spark UI, one worker is lost after a failed job. It is not
a lost executor error, but that the UI now only shows 8 workers (I have 9
workers). However from the ec2 console, it shows the machine is running
and no check alarms. So I am confused how I could reconnect the lost
Hi Jonathan,
Thanks for this information! I will take a look into it. However is there a
way to reconnect the lost node? Or there's no way that I could do to find
back the lost worker?
Thanks!
Anny
On Wed, Jun 24, 2015 at 6:06 PM, Kelly, Jonathan jonat...@amazon.com
wrote:
Just curious, would
I have two Dataframes. A users DF, and an investments DF. The
investments DF has a column that matches the users id. I would like to
nest the collection of investments for each user and save to a parquet file.
Is there a straightforward way to do this?
Thanks.
Richard Catlin
The function accepted by explode is f: Row = TraversableOnce[A]. Seems
user_mentions is an array of structs. So, can you change your
pattern matching to the following?
case Row(rows: Seq[_]) = rows.asInstanceOf[Seq[Row]].map(elem = ...)
On Wed, Jun 24, 2015 at 5:27 AM, Gustavo Arjones
Thanks Nick, Sean for the great suggestions...
Since you guys have already hit these issues before I think it will be
great if we can add the learning to Spark Job Server and enhance it for
community.
Nick, do you see any major issues in using Spray over Scalatra ?
Looks like Model Server API
I can't tell immediately, but you might be able to get more info with the
hint provided here:
http://stackoverflow.com/questions/27980781/spark-task-not-serializable-with-simple-accumulator
(short version, set -Dsun.io.serialization.extendedDebugInfo=true)
Also, unless you're simplifying your
Hello,
I moved from 1.3.1 to 1.4.0 and started receiving
java.lang.OutOfMemoryError: PermGen space when I use spark-shell.
Same Scala code works fine in 1.3.1 spark-shell. I was loading same set of
external JARs and have same imports in 1.3.1.
I tried increasing perm size to 256m. I still got
Hi,
It must be something very straightforward...
Not working:
parallelize(sc)
Error: could not find function parallelize
Working:
df - createDataFrame(sqlContext, localDF)
What did I miss?
Thanks
As long as the logic can be run in parallel, yes. You should not however
load any logic in driver. All logic should run in executors.
On 25 Jun 2015 07:58, asoni.le...@gmail.com wrote:
Any custom script ( python or java or scala )
Thanks ,
Ashish
On Jun 24, 2015, at 4:39 PM, ayan guha
Thaks,
I am talking about streaming.
On 25 Jun 2015 5:37 am, ayan guha guha.a...@gmail.com wrote:
Can you elaborate little more? Are you talking about receiver or streaming?
On 24 Jun 2015 23:18, anshu shukla anshushuk...@gmail.com wrote:
How spark guarantees that no RDD will fail /lost
Did you try to pass it with
--driver-java-options -XX:MaxPermSize=256m
as spark-shell input argument?
Roberto
On Wed, Jun 24, 2015 at 5:57 PM, stati srikanth...@gmail.com wrote:
Hello,
I moved from 1.3.1 to 1.4.0 and started receiving
java.lang.OutOfMemoryError: PermGen space when I
Hi,
I have just uploaded a spark package for dateTime expressions:
https://github.com/SparklineData/spark-datetime. It exposes functions on
DateTime, Period arithmetic and Intervals in sql and provides a simple dsl
to build catalyst expressions about dateTime. A date StringContext lets you
embed
Hi,
Colleagues and I have found that the PageRank implementation bundled
with Spark is incorrect in several ways. The code in question is in
Apache Spark 1.2 distribution's examples directory, called
SparkPageRank.scala.
Consider the example graph presented in the colorful figure on the
I am trying to debug Spark application running on eclipse in
clustered/distributed environment but not able to succeed. Application is
java based and I am running it through Eclipse. Configurations to spark for
Master/worker is provided through Java only.
Though I can debug the code on driver
Hi Terence,
which implementation are you using? I tested it and the results look very
good
id --- result value -percentage --- percentage
(wikipedia)
2: 3.5658816369034536 (38.43986817970977 %), 38.4%
3: 3.1809909923039688 (34.29078328331496 %), 34.3%
5: 0.7503491964913347
Hi Spark user,
I am new to spark so forgive me for asking a basic question. I'm trying to
import my tsv file into spark. This file has key and value separated by a
\t per line. I want to import this file as dictionary of key value pairs in
Spark.
I came across this code to do the same for csv
Hi All ,
We are looking to use spark as our stream processing framework and it would
be helpful if experts can weigh if we made a right choice given below
requirement
Given a stream of data we need to take those event to multiple stage (
pipeline processing ) and in those stage customer will
Hi Arun,
You can achieve this by
setting spark.scheduler.maxRegisteredResourcesWaitingTime to some really
high number and spark.scheduler.minRegisteredResourcesRatio to 1.0.
-Sandy
On Wed, Jun 24, 2015 at 2:21 AM, Steve Loughran ste...@hortonworks.com
wrote:
On 24 Jun 2015, at 05:55, canan
Basically, here's a dump of the SO question I opened
(http://stackoverflow.com/questions/31033724/spark-1-4-0-java-lang-nosuchmethoderror-com-google-common-base-stopwatch-elapse)
I'm using spark 1.4.0 and when running the Scala SparkPageRank example
Its running now.
On Wed, Jun 24, 2015 at 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Now running with
*--num-executors 9973 --driver-memory 14g --driver-java-options
-XX:MaxPermSize=512M -Xmx4096M -Xms4096M --executor-memory 14g
--executor-cores 1*
On Wed, Jun 24, 2015 at 10:34
Hi Michael,
Spark itself is an execution engine, not a storage system. While it has
facilities for caching data in memory, think about these the way you would
think about a process on a single machine leveraging memory - the source
data needs to be stored somewhere, and you need to be able to
Cool. :)
On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Its running now.
On Wed, Jun 24, 2015 at 10:45 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
Now running with
*--num-executors 9973 --driver-memory 14g --driver-java-options
-XX:MaxPermSize=512M -Xmx4096M -Xms4096M
I see this
java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.util.Arrays.copyOfRange(Arrays.java:2694)
at java.lang.String.init(String.java:203)
at java.lang.StringBuilder.toString(StringBuilder.java:405)
at java.io.UnixFileSystem.resolve(UnixFileSystem.java:108)
at
There are multiple of these
1)
15/06/24 09:53:37 ERROR executor.Executor: Exception in task 443.0 in stage
3.0 (TID 1767)
java.lang.OutOfMemoryError: GC overhead limit exceeded
at
sun.reflect.GeneratedSerializationConstructorAccessor1327.newInstance(Unknown
Source)
at
I have a filter.map that triggers 170 tasks. How can i increase it ?
Code:
val viEvents = details.filter(_.get(14).asInstanceOf[Long] != NULL_VALUE).map
{ vi = (vi.get(14).asInstanceOf[Long], vi) }
Deepak
Now running with
*--num-executors 9973 --driver-memory 14g --driver-java-options
-XX:MaxPermSize=512M -Xmx4096M -Xms4096M --executor-memory 14g
--executor-cores 1*
On Wed, Jun 24, 2015 at 10:34 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
There are multiple of these
1)
15/06/24 09:53:37
I have a spark job that's running on a 10 node cluster and the python
process on all the nodes is pegged at 100%.
I was wondering what parts of a spark script are run in the python process
and which get passed to the Java processes? Is there any documentation on
this?
Thanks,
Justin
CDH version: 5.3
Spark Version: 1.2
I was trying to execute a Hive query from Spark code(using HiveContext
class). It was working fine untill we installed Apache Sentry. Now its
giving me read permission exception.
/org.apache.hadoop.security.AccessControlException: Permission denied:
Thanks, that did seem to make a difference. I am a bit scared of this
approach as spark itself has a different guava dependency but the error
does go away this way
On Wed, Jun 24, 2015 at 10:04 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Can you try to add those jars in the SPARK_CLASSPATH
Starting in Spark 1.4 there is also an explode that you can use directly
from the select clause (much like in HiveQL):
import org.apache.spark.sql.functions._
df.select(explode($entities.user_mentions).as(mention))
Unlike standard HiveQL, you can also include other attributes in the select
or
Its taking an hour and on Hadoop it takes 1h 30m, is there a way to make it
run faster ?
On Wed, Jun 24, 2015 at 11:39 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Cool. :)
On 24 Jun 2015 23:44, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Its running now.
On Wed, Jun 24, 2015 at 10:45 AM,
I was curious if any one was able to get CDH 5.4.1 or 5.4.2 compiling with
the v1.4.0 tag out of git? SparkSQL keeps dying on me and not 100% why.
I modified the pom.xml to mak a simple profile to help:
profile
idcdh542/id
properties
java.version1.7/java.version
Hi Deepak,
Parallelism is controlled by the number of partitions. In this case, how many
partitions are there for the details RDD (likely 170).
You can check by running “details.partitions.length”. If you want to increase
parallelism you can do so by repartitioning, increasing the number of
Hello Spark Experts,
I am facing the following issue.
1) I am converting a org.apache.spark.sql.Row into
org.apache.spark.mllib.linalg.Vectors using sparse notation
2) After the parsing proceeds successfully I try to look at the result and
I get the following error:
Quoting from Spark Program guide:
For accumulator updates performed inside *actions only*, Spark guarantees
that each task’s update to the accumulator will only be applied once, i.e.
restarted tasks will not update the value. In transformations, users should
be aware of that each task’s update
80 matches
Mail list logo