Title: Samsung Enterprise Portal mySingle
Hi all,
I'm trying tocomparefunctionsavailable in Spark1.0 hqlto original HiveQL.
But, when I testedfunctions such as 'rank', Spark didn't support some HiveQL functions.
In case of Shark, it supports functions as well as Hive so I want to convert
As simple as that. Indeed, the spark jar i was linking to wasn't the mapr
version. I just added spark-assembly-0.9.1-hadoop1.0.3-mapr-3.0.3.jar to the
lib directory of my project as a unmanaged dependency for sbt.
Thank you Cafe au Lait and to all of you guys.
Regards,
Nelson.
--
View this
Hi, I am getting the following error but I don't understand what the problem
is.
14/05/27 17:44:29 INFO TaskSetManager: Loss was due to java.io.IOException:
Map failed [duplicate 15]
14/05/27 17:44:30 INFO TaskSetManager: Starting task 47.0:43 as TID 60281 on
executor 0: cm07 (PROCESS_LOCAL)
when i using create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 limit 40; , there will be 4
files created on tachyon.
but when i using create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 ; , there will be 35
when i using create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 limit 40; , there will be 4
files created on tachyon.
but when i using create table bigtable002 tblproperties('shark.cache'='
tachyon') as select * from bigtable001 ; , there will be 35
Hi,
We use spark 0.9.1 in standalone mode.
We found lots of app temporary files didn't get removed in each worker
local file system even while the job was finished. These folder have names
such as app-20140516120842-0203.
These files occupied so many disk storage that we have to run a deamon
Did you try the Hive Context? Look under Hive Support here:
http://people.apache.org/~pwendell/catalyst-docs/sql-programming-guide.html
On Tue, May 27, 2014 at 2:09 AM, 정재부 itsjb.j...@samsung.com wrote:
Hi all,
I'm trying to compare functions available in Spark1.0 hql to original
Any suggestion is very much appreciated.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/K-nearest-neighbors-search-in-Spark-tp6393p6421.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
HI,
I am facing a weird issue. I am using spark 0.9 and running a streaming
application.
In the UI, the duration shows order of seconds but if I dig into that
particular stage details, it shows total time taken across all tasks for
the stage is much much less (in milliseconds)
I am using Fair
Hi Jamal,
One nice feature of PySpark is that you can easily use existing functions
from NumPy and SciPy inside your Spark code. For a simple example, the
following uses Spark's cartesian operation (which combines pairs of vectors
into tuples), followed by NumPy's corrcoef to compute the pearson
Hi everyone!
Any recommendation anyone?
Pierre
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Summit-2014-Hotel-suggestions-tp5457p6424.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hello,I’ve installed Spark Cluster spark-0.9.0-incubating-bin-hadoop1, which
works fine.Also, on the same cluster I’ve installed Mesos cluster, using
mesos_0.18.2_x86_64.rpm, which works fine as well.Now,I was trying to
followed the instructions from
I was able to work around this by switching to the SpecificDatum interface
and following this example:
https://github.com/massie/spark-parquet-example/blob/master/src/main/scala/com/zenfractal/SerializableAminoAcid.java
As in the example, I defined a subclass of my Avro type which implemented
the
I am experiencing the same issue (I tried both using Kryo as serializer and
increasing the buffer size up to 256M, my objects are much smaller though).
I share my registrator class just in case:
https://gist.github.com/JordiAranda/5cc16cf102290c413c82
Any hints would be highly appreciated.
Go to expedia/orbitz and look for hotels in the union square neighborhood.
In my humble opinion having visited San Francisco, it is worth any extra
cost to be as close as possible to the conference vs having to travel from
other parts of the city.
On Tue, May 27, 2014 at 9:36 AM, Gerard Maas
Hi guys,
I ended up reserving a room at the Phoenix (Hotel:
http://www.jdvhotels.com/hotels/california/san-francisco-hotels/phoenix-hotel)
recommended by my friend who has been in SF.
According to Google, it takes 11min to walk to the conference which is not
too bad.
Hope this helps!
Jerry
Thanks, Andrew. I'll give it a try.
On Mon, May 26, 2014 at 2:22 PM, Andrew Or and...@databricks.com wrote:
Hi Roger,
This was due to a bug in the Spark shell code, and is fixed in the latest
master (and RC11). Here is the commit that fixed it:
Thanks that's super helpful.
J
On Tue, May 27, 2014 at 8:01 AM, Matt Massie mas...@berkeley.edu wrote:
I really should update that blog post. I created a gist (see
https://gist.github.com/massie/7224868) which explains a cleaner, more
efficient approach.
--
Matt
Also see this context from February. We started working with Chill to get
Avro records automatically registered with Kryo. I'm not sure the final
status, but from the Chill PR #172 it looks like this might be much less
friction than before.
Issue we filed:
Hi Carter,
In Spark 1.0 there will be an implementation of k-means available as part
of MLLib. You can see the documentation for that below (until 1.0 is fully
released).
https://people.apache.org/~pwendell/spark-1.0.0-rc9-docs/mllib-clustering.html
Maybe diving into the source here will help
I keep bumping into a problem with persisting RDDs. Consider this (silly)
example:
def everySecondFromBehind(input: RDD[Int]): RDD[Int] = {
val count = input.count
if (count % 2 == 0) {
return input.filter(_ % 2 == 1)
} else {
return input.filter(_ % 2 == 0)
}
}
The situation is
Thanks for the heads up, I also experienced this issue.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/file-not-found-tp1854p6438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Daniel,
Is SPARK-1103 https://issues.apache.org/jira/browse/SPARK-1103 related to
your example? Automatic unpersist()-ing of unreferenced RDDs would be nice.
Nick
On Tue, May 27, 2014 at 12:28 PM, Daniel Darabos
daniel.dara...@lynxanalytics.com wrote:
I keep bumping into a problem with
Sorry, to clarify: Spark *does* effectively turn Akka's failure detector
off.
On Tue, May 27, 2014 at 10:47 AM, Aaron Davidson ilike...@gmail.com wrote:
Spark should effectively turn Akka's failure detector off, because we
historically had problems with GCs and other issues causing
Spark should effectively turn Akka's failure detector off, because we
historically had problems with GCs and other issues causing
disassociations. The only thing that should cause these messages nowadays
is if the TCP connection (which Akka sustains between Actor Systems on
different machines)
Hi,
Spark newbie here with a general question In a stream consisting of
several types of events, how can I detect if event X happened within Z
transactions of event Y? is it just a matter of iterating thru all the RDDs,
when event type Y found, take the next Z transactions and check if
May be that's explaining mine too.
Thank you very much, Aaron !!
Best regards,
-chanwit
--
Chanwit Kaewkasi
linkedin.com/in/chanwit
On Wed, May 28, 2014 at 12:47 AM, Aaron Davidson ilike...@gmail.com wrote:
Spark should effectively turn Akka's failure detector off, because we
historically
Hi all,
I've a single machine with 8 cores and 8g mem. I've deployed the
standalone spark on the machine and successfully run the examples.
Now I'm trying to write some simple java codes. I just read a local file
(23M) into string list and use JavaRDDString rdds =
sparkContext.paralellize()
To answer my own question, that does seem to be the right way. I was
concerned about whether the data that a broadcast variable would end up
getting serialized if I used it as an instance variable of the function. I
realized that doesnt happen because the broadcast variable's value is
marked as
I am running this on a Solaris machine with logical partitions. All the
partitions (workers) access the same Spark folder.
Thanks,
Suman.
On 5/23/2014 9:44 PM, Andrew Or wrote:
That means not all of your driver and executors have the same version
of Spark. Are you on a standalone EC2
On Tue, May 27, 2014 at 1:05 PM, Suman Somasundar
suman.somasun...@oracle.com wrote:
I am running this on a Solaris machine with logical partitions. All the
partitions (workers) access the same Spark folder.
Can you check whether you have multiple versions of the offending
class
Does the spark UI show your program running? (http://spark-masterIP:8118).
If the program is listed as running you should be able to see details via
the UI. In my experience there are 3 sets of logs -- the log where you're
running your program (the driver), the log on the master node, and the log
I use both Pig and Spark. All my code is built with Maven into a giant
*-jar-with-dependencies.jar. I recently upgraded to Spark 1.0 and now
all my pig scripts fail with:
Caused by: java.lang.RuntimeException: Could not resolve error that
occured when launching map reduce job:
Spark uses 1.7.5, and you should probably see 1.7.{4,5} in use through
Hadoop. But those are compatible.
That method appears to have been around since 1.3. What version does Pig want?
I usually do mvn -Dverbose dependency:tree to see both what the
final dependencies are, and what got
I think what's desired here is for input to be unpersisted automatically as
soon as result is materialized. I don't think there's currently a way to do
this, but the usual workaround is to force result to be materialized
immediately and then unpersist input:
input.cache()val count =
I've got a trained MatrixFactorizationModel via ALS.train(...) and now I'm
trying to use it to predict some ratings like so:
JavaRDDRating predictions = model.predict(usersProducts.rdd())
Where usersProducts is built from an existing Ratings dataset like so:
JavaPairRDDInteger,Integer
Hi Sandeep
I think you should use testRatings.mapToPair instead of testRatings.map.
So the code should be
JavaPairRDDInteger,Integer usersProducts = training.mapToPair(
new PairFunctionRating, Integer, Integer() {
public Tuple2Integer, Integer
Carter,
Just as a quick simple starting point for Spark. (caveats - lots of
improvements reqd for scaling, graceful and efficient handling of RDD et
al):
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import scala.collection.immutable.ListMap
import
I'm trying to determine how to bound my memory use in a job working with
more data than can simultaneously fit in RAM. From reading the tuning
guide, my impression is that Spark's memory usage is roughly the following:
(A) In-Memory RDD use + (B) In memory Shuffle use + (C) Transient memory
used
Title: Samsung Enterprise Portal mySingle
I already tried HiveContext as well as SqlContext.
Butitseems that Spark'sHiveContext is not completely same as Apache Hive.
For example, SQL like 'SELECT RANK() OVER(ORDER BY VAL1 ASC) FROM TEST LIMIT 10' works fine in Apache Hive,
butSpark's Hive
Hi,
Has anyone had luck going through previous archives of the AMPCamp
exercises? Many of the archived bootcamps seem to be broken due to the
fact that it references the same AMIs that is constantly being updated,
which means that it is no longer compatible with the old bootcamp
instructions or
Keith, do you mean bound as in (a) strictly control to some quantifiable
limit, or (b) try to minimize the amount used by each task?
If a, then that is outside the scope of Spark's memory management, which
you should think of as an application-level (that is, above JVM) mechanism.
In this scope,
A dash of both. I want to know enough that I can reason about, rather
than strictly control, the amount of memory Spark will use. If I have a
big data set, I want to understand how I can design it so that Spark's
memory consumption falls below my available resources. Or alternatively,
if it's
43 matches
Mail list logo