RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang-cs.com wrote:
Hi Noorul,
Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
The grouping is determined by the POJO's equals() method. You can also
call groupBy() to group by some function of the POJOs. For example if
you're grouping Doubles into nearly-equal bunches, you could group by
their .intValue()
On Thu, Mar 26, 2015 at 8:47 PM, Mihran Shahinian
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:
RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen
We have Spark on YARN, with Cloudera Manager 5.3.2 and CDH 5.3.2
Jobs link on spark History server doesn't open and shows following message
:
HTTP ERROR: 500
Problem accessing /history/application_1425934191900_87572. Reason:
Server Error
--
*Powered by
It it bought in by another dependency, so you do not need to specify it
explicitly...I think this is what Ted mean.
On Fri, Mar 27, 2015 at 9:48 AM Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
+spark-dev
Yes, the dependencies are there. I guess my question is how come the build
is
bcc: user@, cc: cdh-user@
I recommend using CDH's mailing list whenever you have a problem with CDH.
That being said, you haven't provided enough info to debug the
problem. Since you're using CM, you can easily go look at the History
Server's logs and see what the underlying error is.
On Thu,
Thanks all. I am installing Spark 1.3 now. Thought that I should better sync
with the daily evolution of this new technology.
So once I install that, I will try to use the Spark-CSV library.
Regards
Ananda
From: Dean Wampler [mailto:deanwamp...@gmail.com]
Sent: Wednesday, March 25, 2015 1:17 PM
Looks like the following assertion failed:
Preconditions.checkState(storageIDsCount == locs.size());
locs is ListDatanodeInfoProto
Can you enhance the assertion to log more information ?
Cheers
On Thu, Mar 26, 2015 at 3:06 PM, Dale Johnson daljohn...@ebay.com wrote:
There seems to be a
How do I get the number of cores that I specified at the command line? I
want to use spark.default.parallelism. I have 4 executors, each has 8
cores. According to
https://spark.apache.org/docs/1.2.0/configuration.html#execution-behavior,
the spark.default.parallelism value will be 4 * 8 = 32...I
Yes, that is the correct understanding. There are undocumented parameters
that allow that, but I do not recommend using those :)
TD
On Wed, Mar 25, 2015 at 6:57 AM, Luis Ángel Vicente Sánchez
langel.gro...@gmail.com wrote:
I have a simple and probably dumb question about foreachRDD.
We are
Hi Sandeep,
I followed the DenseKMeans example which comes with the spark package.
My total vectors are about 40k, and my k=500. All my code are written in
Scala.
Thanks,
David
On Fri, 27 Mar 2015 05:51 sandeep vura sandeepv...@gmail.com wrote:
Hi Shen,
I am also working on k means
Hi Burak,
My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?
My spark is 1.2.0, working in windows 64 bit. My data set is about 40k
vectors, each vector has about 300 features, all normalised. All work node
have sufficient memory and disk space.
+spark-dev
Yes, the dependencies are there. I guess my question is how come the build
is succeeding in the mainline then, without adding these dependencies?
On Thu, Mar 26, 2015 at 3:44 PM, Ted Yu yuzhih...@gmail.com wrote:
Looking at output from dependency:tree, servlet-api is brought in by
OH, the job I talked about has ran more than 11 hrs without a result...it
doesn't make sense.
On Fri, Mar 27, 2015 at 9:48 AM Xi Shen davidshe...@gmail.com wrote:
Hi Burak,
My iterations is set to 500. But I think it should also stop of the
centroid coverages, right?
My spark is 1.2.0,
Thanks all for the quick response.
Thanks.
Zhan Zhang
On Mar 26, 2015, at 3:14 PM, Patrick Wendell pwend...@gmail.com wrote:
I think we have a version of mapPartitions that allows you to tell
Spark the partitioning is preserved:
Hi,
We are trying to build spark 1.2 from source (tip of the branch-1.2 at the
moment). I tried to build spark using the following command:
mvn -U -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive
-Phive-thriftserver -DskipTests clean package
I encountered various missing class definition
Looking at output from dependency:tree, servlet-api is brought in by the
following:
[INFO] +- org.apache.cassandra:cassandra-all:jar:1.2.6:compile
[INFO] | +- org.antlr:antlr:jar:3.2:compile
[INFO] | +- com.googlecode.json-simple:json-simple:jar:1.1:compile
[INFO] | +-
will do! I've got to clear with my boss what I can post and in what manner, but
I'll definitely do what I can to put some working code out into the world so
the next person who runs into this brick wall can benefit from all this :-D
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758
Hi all,
For my master thesis I will be characterising performance of two-level
schedulers like Mesos and after reading the paper:
https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
where Spark is also introduced I am wondering how some experiments and
results came about.
If this is not the
The code is very simple.
val data = sc.textFile(very/large/text/file) map { l =
// turn each line into dense vector
Vectors.dense(...)
}
// the resulting data set is about 40k vectors
KMeans.train(data, k=5000, maxIterations=500)
I just kill my application. In the log I found this:
Dear all,
I am trying to upgrade the spark from 1.2 to 1.3 and switch the existed API
of creating SchemaRDD to DataFrame.
After testing, I notice that the following behavior is changed:
```
import java.sql.Date
import com.bridgewell.SparkTestUtils
import org.apache.spark.rdd.RDD
import
Anyone can shed some light on this?
On Tue, Mar 17, 2015 at 5:23 PM, Chen Song chen.song...@gmail.com wrote:
I have a map reduce job that reads from three logs and joins them on some
key column. The underlying data is protobuf messages in sequence
files. Between mappers and reducers, the
Creating a SparkContext and setting master as yarn-cluster unfortunately
will not work.
SPARK-4924 added APIs for doing this in Spark, but won't be included until
1.4.
-Sandy
On Tue, Mar 17, 2015 at 3:19 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Create SparkContext set master as
Using spark 1.3.0 on cdh5.1.0, I was running a fetch failed exception.
I searched in this email list but not found anything like this reported.
What could be the reason for the error?
org.apache.spark.shuffle.FetchFailedException: [EMPTY_INPUT] Cannot
decompress empty stream
at
Here is the work track:
Hi,
Did you run the word count example in Spark local mode or other mode, in
local mode you have to set Local[n], where n =2. For other mode, make sure
available cores larger than 1. Because the receiver inside Spark Streaming
wraps as a long-running task, which will at least occupy one core.
Did you manage to connect to Hive metastore from Spark SQL. I copied hive
conf file into Spark conf folder but when i run show tables, or do select *
from dw_bid (dw_bid is stored in Hive) it says table not found.
On Thu, Mar 26, 2015 at 11:43 PM, Chang Lim chang...@gmail.com wrote:
Solved.
I am now seeing this error.
15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw exception:
FAILED: SemanticException Line 1:23 Invalid path
''examples/src/main/resources/kv1.txt'': No files matching path
Try to give the complete path to the file kv1.txt.
On 26 Mar 2015 11:48, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I am now seeing this error.
15/03/25 19:44:03 ERROR yarn.ApplicationMaster: User class threw
exception: FAILED: SemanticException Line 1:23 Invalid path
From a quick look at this link -
http://accumulo.apache.org/1.6/accumulo_user_manual.html#_mapreduce - it
seems you need to call some static methods on AccumuloInputFormat in order
to set the auth, table, and range settings. Try setting these config
options first and then call newAPIHadoopRDD?
On
Whats your spark version? Not quiet sure, but you could be hitting this
issue https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-4516
On 26 Mar 2015 11:01, Xi Shen davidshe...@gmail.com wrote:
Hi,
My environment is Windows 64bit, Spark + YARN. I had a job that takes a
long
Try registering your MyObject[] with Kryo.
On 25 Mar 2015 13:17, donhoff_h 165612...@qq.com wrote:
Hi, experts
I wrote a very simple spark program to test the KryoSerialization
function. The codes are as following:
object TestKryoSerialization {
def main(args: Array[String]) {
val
Heres something similar which i used to do:
unionDStream.foreachRDD(rdd = { val events = rdd.count() println(Received
Events : + rdd.count()) if(events 0 ){ val fw = new
FileWriter(events, true) fw.write(Calendar.getInstance().getTime + , +
events + \n) fw.close() } })
Sending from cellphone,
Hello DB,
Thank you! Do you know how to run Linear Regression without SGD on
streaming data in spark? I tried SGD but due to step size I was not getting
the expected weights.
Best Regards,
Arunkumar
On Wed, Mar 25, 2015 at 4:33 PM, DB Tsai dbt...@dbtsai.com wrote:
Hi Arunkumar,
I think
ah~hell, I am using Spark 1.2.0, and my job was submitted to use 8
cores...the magic number in the bug.
[image: --]
Xi Shen
[image: http://]about.me/davidshen
http://about.me/davidshen?promo=email_sig
http://about.me/davidshen
On Thu, Mar 26, 2015 at 5:48 PM, Akhil Das
Specifically there are only 5 aggregate functions in class
org.apache.spark.sql.GroupedData: sum/max/min/mean/count.
Can I plugin a function to calculate stddev?
Thank you!
-
To unsubscribe, e-mail:
I think it is not `sqlContext` but hiveContext because `create temporary
function` is not supported in SQLContext.
On Wed, Mar 25, 2015 at 5:58 AM, Jon Chase jon.ch...@gmail.com wrote:
Shahab -
This should do the trick until Hao's changes are out:
sqlContext.sql(create temporary function
I have a hive table named dw_bid, when i run hive from command prompt and
run describe dw_bid, it works.
I want to join a avro file (table) in HDFS with this hive dw_bid table and
i refer it as dw_bid from Spark SQL program, however i see
15/03/26 00:31:01 INFO HiveMetaStore.audit:
Does not work
15/03/26 01:07:05 INFO HiveMetaStore.audit: ugi=dvasthimal
ip=unknown-ip-addr cmd=get_table : db=default tbl=src_spark
15/03/26 01:07:06 ERROR ql.Driver: FAILED: SemanticException Line 1:23
Invalid path
Resolved. Bold text is FIX.
./bin/spark-submit -v --master yarn-cluster --jars
Now its clear that the workers are not having the file kv1.txt in their
local filesystem. You can try putting that in hdfs and use the URI to that
file or try adding the file with sc.addFile
Thanks
Best Regards
On Thu, Mar 26, 2015 at 1:38 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Does not
Hi all,
For my master thesis I will be characterising performance of two-level
schedulers like Mesos and after reading the paper:
https://www.cs.berkeley.edu/~alig/papers/mesos.pdf
where Spark is also introduced I am wondering how some experiments and results
came about.
If this is not the
What does show tables return? You can also run SET optionName to
make sure that entries from you hive site are being read correctly.
On Thu, Mar 26, 2015 at 4:02 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I have tables dw_bid that is created in Hive and has nothing to do with
Spark. I have
Hi,
I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn version)
of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM.
The thing is, my application should run 24/7 but the disk usage is leaking.
This
Hi.
I'm trying to trigger DataFrame's save method in parallel from my driver.
For that purposes I use ExecutorService and Futures, here's my code:
val futures = [1,2,3].map( t = pool.submit( new Runnable {
override def run(): Unit = {
val commons = events.filter(_._1 == t).map(_._2.common)
As I wrote previously - indexing is not your only choice, you can
preaggregate data during load or depending on your needs you need to think
about other data structures, such as graphs, hyperloglog, bloom filters
etc. (challenge to integrate in standard bi tools)
Le 26 mars 2015 13:34, kundan
How can we catch exceptions that are thrown from custom RDDs or custom map
functions?
We have a custom RDD that is throwing an exception that we would like to
catch but the exception that is thrown back to the caller is a
*org.apache.spark.SparkException* that does not contain any useful
An RDD is a very different creature than a NoSQL store, so I would not
think of them as in the same ball-park for NoSQL-like workloads. It's
not built for point queries or range scans, since any request would
launch a distributed job to scan all partitions. It's not something
built for, say,
After upgrading to 1.3.0, ALS.trainImplicit() has been returning vastly
smaller factors (and hence scores). For example, the first few product's
factor values in 1.2.0 are (0.04821, -0.00674, -0.0325). In 1.3.0, the
first few factor values are (2.535456E-8, 1.690301E-8, 6.99245E-8). This
Hello Michael,
Thanks for your time.
1. show tables from Spark program returns nothing.
2. What entities are you talking about ? (I am actually new to Hive as well)
On Thu, Mar 26, 2015 at 8:35 PM, Michael Armbrust mich...@databricks.com
wrote:
What does show tables return? You can also run
hi Nick
Unfortunately the Accumulo docs are woefully inadequate, and in some places,
flat wrong. I'm not sure if this is a case where the docs are 'flat wrong', or
if there's some wrinke with spark-notebook in the mix that's messing everything
up. I've been working with some people on stack
Hi,
I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn
version) of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM.
The thing is, my application should run 24/7 but the disk usage is leaking.
This
It is logged from RecurringTimer#loop():
private def loop() {
try {
while (!stopped) {
clock.waitTillTime(nextTime)
callback(nextTime)
prevTime = nextTime
nextTime += period
logDebug(Callback for + name + called at time + prevTime)
}
I would suggest looking for errors in the logs of your executors.
On Thu, Mar 26, 2015 at 3:20 AM, 李铖 lidali...@gmail.com wrote:
Again,when I do larger file Spark-sql query, error occured.Anyone have got
fix it .Please help me.
Here is the track.
Stack Trace:
15/03/26 08:25:42 INFO ql.Driver: OK
15/03/26 08:25:42 INFO log.PerfLogger: PERFLOG method=releaseLocks
from=org.apache.hadoop.hive.ql.Driver
15/03/26 08:25:42 INFO log.PerfLogger: /PERFLOG method=releaseLocks
start=1427383542966 end=1427383542966 duration=0
101 - 155 of 155 matches
Mail list logo