Here's the java api docs
https://spark.apache.org/docs/latest/api/java/index.html
You can start with this example and convert it into Java (its pretty
straight forward)
http://stackoverflow.com/questions/25484879/sql-over-spark-streaming
Eg:
In Scala :
val sparkConf = new SparkConf().setMaster
Hi,
I have a existing batch processing system which use SQL queries to extract
information from data. I want to replace this with Real time system.
I am coding in Java and to use SQL in Streaming data i found few examples
but none of them is complete.
http://apache-spark-user-list.1001560.n3.nabb
FYI,
Latest hive 0.14/parquet will have column renaming support.
Jianshi
On Wed, Dec 10, 2014 at 3:37 AM, Michael Armbrust
wrote:
> You might also try out the recently added support for views.
>
> On Mon, Dec 8, 2014 at 9:31 PM, Jianshi Huang
> wrote:
>
>> Ah... I see. Thanks for pointing it
i've eliminated fetch failed with this parameters (don't know which was
the right one for the problem)
to the spark-submit running with 1.2.0
--conf spark.shuffle.compress=false \
--conf spark.file.transferTo=false \
--conf spark.shuffle.manager=hash \
--conf spar
Hi dears!
I want to convert a schemaRDD into RDD of String. How can we do that?
Currently I am doing like this which is not converting correctly no
exception but resultant strings are empty
here is my code
def SchemaRDDToRDD( schemaRDD : SchemaRDD ) : RDD[ String ] = {
var types
I am trying to use sql over Spark streaming using Java. But i am getting
Serialization Exception.
public static void main(String args[]) {
SparkConf sparkConf = new SparkConf().setAppName("NumberCount");
JavaSparkContext jc = new JavaSparkContext(sparkConf);
JavaStreamingContext jssc =
Hi Josh,
If you want to cap the amount of memory per executor in Coarse grain mode,
then yes you only get 240GB of memory as you mentioned. What's the reason
you don't want to raise the capacity of memory you use per executor?
In coarse grain mode the Spark executor is long living and it internal
See http://search-hadoop.com/m/JW1q5Cew0j
On Tue, Dec 23, 2014 at 8:00 PM, guxiaobo1982 wrote:
> Hi,
> The official pom.xml file only have profile for hadoop version 2.4 as the
> latest version, but I installed hadoop version 2.6.0 with ambari, how can I
> build spark against it, just using mvn
Hi,
The official pom.xml file only have profile for hadoop version 2.4 as the
latest version, but I installed hadoop version 2.6.0 with ambari, how can I
build spark against it, just using mvn -Dhadoop.version=2.6.0, or how to make a
coresponding profile for it?
Regards,
Xiaobo
Hi,
On Fri, Dec 19, 2014 at 6:53 PM, Ashic Mahtab wrote:
>
> val doSomething(entry:SomeEntry, session:Session) : SomeOutput = {
> val result = session.SomeOp(entry)
> SomeOutput(entry.Key, result.SomeProp)
> }
>
> I could use a transformation for rdd.map, but in case of failures, the map
Thanks All.
Finally the works code is below:
object PlayRecord {
def getUserActions(accounts: RDD[String], idType: Int, timeStart: Long,
timeStop: Long, cacheSize: Int,
filterSongDays: Int, filterPlaylistDays: Int):
RDD[(String, (Int, Set[Long], Set[Long]))] = {
acc
It's working now. Probably I didn't specify the excluded list correctly. I
kept revising it and now it's working. Thanks.
Ey-Chih Chow
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Debugging-a-Spark-application-using-Eclipse-throws-SecurityException-t
Hi, Lam, I can confirm this is a bug with the latest master, and I filed a jira
issue for this:
https://issues.apache.org/jira/browse/SPARK-4944
Hope come with a solution soon.
Cheng Hao
From: Jerry Lam [mailto:chiling...@gmail.com]
Sent: Wednesday, December 24, 2014 4:26 AM
To: user@spark.apac
I am using Eclipse to develop a Spark application (using Spark 1.1.0). I use
the ScalaTest framework to test the application. But I was blocked by
the following exception:
java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s
signer
information does not match signer informatio
Could you please provide a complete stacktrace? Also it would be good if
you can share your hive-site.xml as well.
On 12/23/14 4:42 PM, Dai, Kevin wrote:
Hi, there
When I use hive udf from_unixtime with the HiveContext, the job block
and the log is as follow:
sun.misc.Unsafe.park(Native Me
You should be able to kill the job using the webUI or via spark-class.
More info can be found in the thread:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-kill-a-Spark-job-running-in-cluster-mode-td18583.html.
HTH!
On Tue, Dec 23, 2014 at 4:47 PM, durga wrote:
> Hi All ,
>
> It se
Hi Phil,
This sounds a lot like a deadlock in Hadoop's Configuration object that I
ran into a while back. If you jstack the JVM and see a thread that looks
like the below, it could be https://issues.apache.org/jira/browse/SPARK-2546
"Executor task launch worker-6" daemon prio=10 tid=0x7f91f0
Here is a more cleaned up version, can be used in |./sbt/sbt
hive/console| to easily reproduce this issue:
|sql("SELECT * FROM src WHERE key % 2 = 0").
sample(withReplacement =false, fraction =0.05).
registerTempTable("sampled")
println(table("sampled").queryExecution)
val query = sql("
Hi All ,
It seems problem is little more complicated.
If the job is hungup on reading s3 file.even if I kill the unix process that
started the job, it is not killing spark-job. It is still hung up there.
Now the questions are :
How do I find spark-job based on the name?
How do I kill the spark-
Hi All,
We are getting exception after we added one RDD to another RDD.
We first declared an empty RDD "A", then received new Dstream "B" from
Kafka; for each RDD in the Dstream "B", we kept adding them to the existing
RDD "A". Error happened when we were trying to use the updated RDD "A".
Could
Hi there
We are on mllib 1.1.1, and trying different regularization parameters. We
noticed that the regParam dont affect the weights at all. Is setting the
reg param via the optimizer the right thing to do? Do we need to set our
own updater? Anyone else seeing the same behaviour?
thanks again
tho
I am trying to load a Parquet file which has a comma in its name. Yes, this
is a valid file name in HDFS. However, sqlContext.parquetFile interprets
this as a comma-separated list of parquet files.
Is there any way to escape the comma so it is treated as part of a single
file name?
--
Daniel Sie
Turns out I was using the s3:// prefix (in a standalone Spark cluster). It
was writing a LOT of block_* files to my S3 bucket, which was the cause for
the slowness. I was coming from Amazon EMR, where Amazon's underlying FS
implementation has re-mapped s3:// to s3n://, which doesn't use the block
I am wondering if there is a way to update an RDD that is used in a transform
operation of a DStream.
To use the example from the spark streaming programming guide, let’s say I want
to update my spamInfoRDD once an hour without having to restart the streaming
app.
If an RDD used in Transform op
http://www.jets3t.org/toolkit/configuration.html
Put the following properties in a file named jets3t.properties and make
sure it is available during the running of your Spark job (just place it in
~/ and pass a reference to it when calling spark-submit with --file
~/jets3t.properties)
httpclien
Hi spark users,
I'm trying to create external table using HiveContext after creating a
schemaRDD and saving the RDD into a parquet file on hdfs.
I would like to use the schema in the schemaRDD (rdd_table) when I create
the external table.
For example:
rdd_table.saveAsParquetFile("/user/spark/my_
Have a look at RDD.groupBy(...) and reduceByKey(...)
On Tue, Dec 23, 2014 at 4:47 AM, sachin Singh
wrote:
> Hi,
> I have a csv file having fields as a,b,c .
> I want to do aggregation(sum,average..) based on any field(a,b or c) as per
> user input,
> using Apache Spark Java API,Please Help Urgen
I've had a lot of difficulties with using the s3:// prefix. s3n:// seems
to work much better. Can't find the link ATM, but seems I recall that
s3:// (Hadoop's original block format for s3) is no longer recommended for
use. Amazon's EMR goes so far as to remap the s3:// to s3n:// behind the
scene
I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) but I
am still seeing FetchFailedException.
On Mon, Dec 22, 2014 at 8:27 AM, steghe wrote:
> Which version of spark are you running?
>
> It could be related to this
> https://issues.apache.org/jira/browse/SPARK-3633
>
> fixed in
Silly question, what is the best way to shuffle protobuf messages in Spark
(Streaming) job? Can I use Kryo on top of protobuf Message type?
--
Chen Song
Yes, my change is slightly downstream of this point in the processing
though. The code is still creating a counter for each distinct score
value, and then binning. I don't think that would cause a failure -
just might be slow. At the extremes, you might see 'fetch failure' as
a symptom of things ru
Xiangrui: Hi, I want to using this with streaming and with job too. I using
kafka (streaming) and elasticsearch (job) as source and want to calculate
sentiment value from the input text.
Simon: great, you have any doc how can I embed into my application without
using the http interface? (how can I
Hey Xiangrui,
Is there any plan to have a streaming compatible ALS version?
Or if it's currently doable, is there any example?
On Tue, Dec 23, 2014 at 4:31 PM, Xiangrui Meng wrote:
> We have streaming linear regression (since v1.1) and k-means (v1.2) in
> MLlib. You can check the user gu
Hi all,
I'm trying to lock down ALL Spark ports and have tried using
spark-defaults.conf and via the sparkContext. (The example below was run in
local[*] mode, but all attempts to run in local or spark-submit.sh on
cluster via jar all result in the same results).
My goal is to define all commu
yep Michael Quinlan,it's working as suggested by Hoe Ren
thansk to you and Hoe Ren
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20840.html
Sent from the Apache Spark User List mailing list archive at Nabble.
Sean's PR may be relevant to this issue
(https://github.com/apache/spark/pull/3702). As a workaround, you can
try to truncate the raw scores to 4 digits (e.g., 0.5643215 -> 0.5643)
before sending it to BinaryClassificationMetrics. This may not work
well if he score distribution is very skewed. See
We have streaming linear regression (since v1.1) and k-means (v1.2) in
MLlib. You can check the user guide:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression
http://spark.apache.org/docs/latest/mllib-clustering.html#streaming-clustering
-Xiangrui
On Tue, D
Hafiz,
You can probably use the RDD.mapPartitionsWithIndex method.
Mike
On Tue, Dec 23, 2014 at 8:35 AM, Hafiz Mujadid [via Apache Spark User List]
wrote:
>
> hi dears!
>
> Is there some efficient way to drop first line of an RDD[String]?
>
> any suggestion?
>
> Thanks
>
> -
Hi Gerard,
SPARK-4286 is the ticket I am working on, which besides supporting shuffle
service it also supports the executor scaling callbacks (kill/request
total) for coarse grain mode.
I created SPARK-4940 to discuss more about the distribution problem, and
let's bring our discussions there.
Ti
Hi,
I have recently seen a demo of Spark where different pieces were put
together (training via MLlib + deploying on Spark Streaming).
I was wondering if MLlib currently works to directly train on Streaming.
And, if so, what are the semantics of the algorithms?
If not, would it be interesting to h
I've been attempting to run a job based on MLlib's ALS implementation for a
while now and have hit an issue I'm having a lot of difficulty getting to
the bottom of.
On a moderate size set of input data it works fine, but against larger
(still well short of what I'd think of as big) sets of data, I
There is also a lazy implementation:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/
I generated a PR for it -- there was also an alternate proposal for having it
be a library in the new Spark Packages site:
http://databricks.com/bl
that's nice if it works
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834p20837.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
On YARN, spark does not manage the cluster, but YARN does. Usually the
cluster manager UI is under http://:9026/cluster. I believe
that it chooses the port for the spark driver UI randomly, but an easy way
of accessing it is by clicking on the "Application Master" link under the
"Tracking UI" colum
Hi there,
We are using mllib 1.1.1, and doing Logistics Regression with a dataset of
about 150M rows.
The training part usually goes pretty smoothly without any retries. But
during the prediction stage and BinaryClassificationMetrics stage, I am
seeing retries with error of "fetch failure".
The p
Hi,
maybe the drop function is helpful for you (even though this is probably
more than you need, still interesting read)
http://erikerlandson.github.io/blog/2014/07/27/some-implications-of-supporting-the-scala-drop-method-for-spark-rdds/
Joerg
On Tue, Dec 23, 2014 at 5:45 PM, Hao Ren wrote:
> H
Hi,
I guess you would like to remove the header of a CSV file.
You can play with partitions. =)
// src is your RDD
val noHeader = src.mapPartitionsWithIndex(
(i, iterator) =>
if (i == 0 && iterator.hasNext) {
iterator.next
iterator
} else iterator)
Thus, you don't need to
One observation is that:
if fraction is big, say 50% - 80%, sampling is good, everything run as
expected.
But if fraction is small, for example, 5%, sampled data contains wrong rows
which should have been filtered.
The workaround is materializing t1 first:
t1.cache
t1.count
These operations make
hi dears!
Is there some efficient way to drop first line of an RDD[String]?
any suggestion?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/removing-first-record-from-RDD-String-tp20834.html
Sent from the Apache Spark User List mailing list archive
update:
t1 is good. After collecting on t1, I find that all row is ok (is_new = 0)
Just after sampling, there are some rows where is_new = 1 which should have
been filtered by Where clause.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SchemaRDD-sample-
Hello folks,
I'm trying to deploy a Spark driver on Amazon EMR in yarn-cluster mode
expecting to be able to access the Spark UI from the :4040
address (default port). The problem here is that the Spark UI port is
always defined randomly at runtime, although I also tried to specify it in
the spark-
Doh...figured it out.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-Array-type-support-Unregonized-Thrift-TTypeId-value-ARRAY-TYPE-tp20817p20832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
Thanks for the clarification Sean.
Best Regards,Guru Medasani
> From: so...@cloudera.com
> Date: Tue, 23 Dec 2014 15:39:59 +
> Subject: Re: Spark Installation Maven PermGen OutOfMemoryException
> To: gdm...@outlook.com
> CC: protsenk...@gmail.com; user@spark.apache.org
>
> The text there
The text there is actually unclear. In Java 8, you still need to set
the max heap size (-Xmx2g). The optional bit is the
"-XX:MaxPermSize=512M" actually. Java 8 no longer has a separate
permanent generation.
On Tue, Dec 23, 2014 at 3:32 PM, Guru Medasani wrote:
> Hi Vladimir,
>
> From the link Se
Hi Vladimir,
>From the link Sean posted, if you use Java 8 there is this following note.
Note: For Java 8 and above this step is not required.
So if you have no problems using Java 8, give it a shot.
Best Regards,Guru Medasani
> From: so...@cloudera.com
> Date: Tue, 23 Dec 2014 15:04:42 +000
This Parquet bug only triggers when there exists some row groups which
are either empty or contain only null binary values.
So it’s still safe to turn it on if data types of all columns are
boolean, numeric, and non-null binaries.
You may turn it on by |SET spark.sql.parquet.filterPushdown=tr
This depends on which output format you want. For Parquet, you can
simply do this:
|hiveContext.table("some_db.some_table").saveAsParquetFile("hdfs://path/to/file")
|
On 12/23/14 5:22 PM, LinQili wrote:
Hi Leo:
Thanks for your reply.
I am talking about using hive from spark to export data fro
I think you should use minimum of 2gb of memory for building it from maven .
-Somnath
-Original Message-
From: Vladimir Protsenko [mailto:protsenk...@gmail.com]
Sent: Tuesday, December 23, 2014 8:28 PM
To: user@spark.apache.org
Subject: Spark Installation Maven PermGen OutOfMemoryExceptio
You might try a little more. The official guidance suggests 2GB:
https://spark.apache.org/docs/latest/building-spark.html#setting-up-mavens-memory-usage
On Tue, Dec 23, 2014 at 2:57 PM, Vladimir Protsenko
wrote:
> I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github,
> in
I am installing Spark 1.2.0 on CentOS 6.6. Just downloaded code from github,
installed maven and trying to compile system:
git clone https://github.com/apache/spark.git
git checkout v1.2.0
mvn -DskipTests clean package
leads to OutOfMemoryException. What amount of memory does it requires?
expor
ᐧ
I filed a new issue HADOOP-11444. According to HADOOP-10372, s3 is likely
to be deprecated anyway in favor of s3n.
Also the comment section notes that Amazon has implemented an EmrFileSystem
for S3 which is built using AWS SDK rather than JetS3t.
On Tue, Dec 23, 2014 at 2:06 PM, Enno Shioji
Hey Jay :)
I tried "s3n" which uses the Jets3tNativeFileSystemStore, and the double
slash went away.
As far as I can see, it does look like a bug in hadoop-common; I'll file a
ticket for it.
Hope you are doing well, by the way!
PS:
Jets3tNativeFileSystemStore's implementation of pathToKey is:
=
Hi enno. Might be worthwhile to cross post this on dev@hadoop... Obviously a
simple spark way to test this would be to change the uri to write to hdfs:// or
maybe you could do file:// , and confirm that the extra slash goes away.
- if it's indeed a jets3t issue we should add a new unit test for
Is anybody experiencing this? It looks like a bug in JetS3t to me, but
thought I'd sanity check before filing an issue.
I'm writing to S3 using ReceiverInputDStream#saveAsTextFiles with a S3 URL
("s3://fake-test/1234").
The code does write to S3, but with double forward slashes
Right. I contacted the SummingBird users as well. It doesn't support Spark
streaming currently.
We are heading towards Storm as it is mostly widely used. Is Spark
streaming production ready?
Thanks
Ajay
On Tue, Dec 23, 2014 at 3:47 PM, Gerard Maas wrote:
> I'm not aware of a project trying to
I'm not aware of a project trying to achieve this integration. At some
point Summingbird had the intention of adding an Spark port, and that could
potentially bridge Storm and Spark. Not sure if that evolved into something
concrete.
In any case, an attempt to bring Storm and Spark together will co
After checking the spark code, I now realize that an rdd that was cached to
disk can't be evicted, so I will just persist the rdd to disk after the
random numbers are created.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Consistent-hashing-of-RDD-row-tp20
Hi,
I have a csv file having fields as a,b,c .
I want to do aggregation(sum,average..) based on any field(a,b or c) as per
user input,
using Apache Spark Java API,Please Help Urgent!
Thanks in advance,
Regards
Sachin
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Hi Leo:Thanks for your reply.I am talking about using hive from spark to export
data from hive to hdfs.maybe like: val exportData = s"insert overwrite
directory '/user/linqili/tmp/src' select * from $DB.$tableName"
hiveContext.sql(exportData)but it was unsupported in spark now:Exceptio
(In your libsvm example, your indices are not ascending.)
The first weight corresponds to the first feature, of course. An
indexing scheme doesn't change that or somehow make the first feature
map to the second (where would the last one go then?). You'll find the
first weight at offset 0 in an arr
Hi,
The question is to do streaming in Spark with Storm (not using Spark
Streaming).
The idea is to use Spark as a in-memory computation engine and static data
coming from Cassandra/Hbase and streaming data from Storm.
Thanks
Ajay
On Tue, Dec 23, 2014 at 2:03 PM, Gerard Maas wrote:
> Hi,
>
>
Hi, there
When I use hive udf from_unixtime with the HiveContext, the job block and the
log is as follow:
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueu
Hi,
I'm not sure what you are asking:
Whether we can use spouts and bolts in Spark (=> no)
or whether we can do streaming in Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
-kr, Gerard.
On Tue, Dec 23, 2014 at 9:03 AM, Ajay wrote:
> Hi,
>
> Can we use Storm Stre
Hi all:I wonder if is there a way to export data from table of hive into hdfs
using spark?like this: INSERT OVERWRITE DIRECTORY '/user/linqili/tmp/src'
select * from $DB.$tableName
That's because in your code at some place you have specified localhost
instead of the ip address of the machine running the service. When run it in
local mode it will work fine because everything happens on that machine and
hence it will be able to connect to localhost which runs the service, now o
Hi,
Can we use Storm Streaming as RDD in Spark? Or any way to get Spark work
with Storm?
Thanks
Ajay
Hi,
I am having the same problem.
Any solution to that?
Thanks,
Shai
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Job-hangs-up-on-multi-node-cluster-but-passes-on-a-single-node-tp15886p20826.html
Sent from the Apache Spark User List mailing list a
77 matches
Mail list logo