If you want to submit applications to a remote cluster where your port 7077
is opened publically, then you would need to set the *spark.driver.host *(with
the public ip of your laptop) and *spark.driver.port* (optional, if there's
no firewall between your laptop and the remote cluster). Keeping
Hi,
I am using spark 1.3.1. I tried to insert (a new partition) into an
existing partitioned hive table, but got ArrayIndexOutOfBoundsException.
Below is a code snippet and the debug log. Any suggestions please.
+
case class Record4Dim(key: String,
Hi,
As per the doc https://github.com/databricks/spark-avro/blob/master/README.md,
Union type doesn’t support all kind of combination.
Is there any plan to support union type having string long in near future?
Thanks
Shagun Agarwal
Hi,
We are using spark sql (1.3.1) to load data from Microsoft sql server using
jdbc (as described in
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
).
It is working fine except when there is a space in column names (we can't
modify the schemas to remove
You could try rdd.persist(MEMORY_AND_DISK/DISK_ONLY).flatMap(...), I think
StorageLevel MEMORY_AND_DISK means spark will try to keep the data in
memory and if there isn't sufficient space then it will be shipped to the
disk.
Thanks
Best Regards
On Mon, Jun 1, 2015 at 11:02 PM, octavian.ganea
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop directory and typing sbin/stop-dfs.sh and
then sbin/start-dfs.sh
Thanks
Best Regards
On Tue, Jun 2, 2015 at 5:03 AM, Su She suhsheka...@gmail.com wrote:
Hello All,
A bit scared I did
Hi, the code is some hundreds lines of Python. I can try to compose a
minimal example as soon as I find the time, though. Any ideas until
then?
Would you mind posting the code?
On 2 Jun 2015 00:53, Karlson ksonsp...@siberie.de wrote:
Hi,
In all (pyspark) Spark jobs, that become somewhat
I found an interesting presentation
http://www.slideshare.net/colorant/spark-shuffle-introduction and go
through this thread also
http://apache-spark-user-list.1001560.n3.nabble.com/How-does-shuffle-work-in-spark-td584.html
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:06 PM, ÐΞ€ρ@Ҝ (๏̯͡๏)
You can try to skip the tests, try with:
mvn -Dhadoop.version=2.4.0 -Pyarn *-DskipTests* clean package
Thanks
Best Regards
On Tue, Jun 2, 2015 at 2:51 AM, Stephen Boesch java...@gmail.com wrote:
I downloaded the 1.3.1 distro tarball
$ll ../spark-1.3.1.tar.gz
-rw-r-@ 1 steve staff
I was tried using reduceByKey, without success.
I also tried this: rdd.persist(MEMORY_AND_DISK).flatMap(...).reduceByKey .
However, I got the same error as before, namely the error described here:
Erik,
Thank you for your answer. It seems really good, but unfortunately I'm
not very familiar with Scala, so I have partly understood.
Could you please explain your idea with Spark implementation?
Best regards,
Marko
On Mon 01 Jun 2015 06:35:17 PM CEST, Erik Erlandson wrote:
I haven't
Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that
worked fine for Spark 1.3. The job starts on the cluster (yarn-cluster
mode), initial stage starts, but the job fails before any task succeeds
with the following error. Any hints?
[ERROR] [06/02/2015 09:05:36.962] [Executor
Is it input and ouput bytes/record size ?
--
Deepak
Hi Dimple,
take a look to existing transformers:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala
Are you sure it's memory related? What is the disk utilization and IO
performance on the workers? The error you posted looks to be related to
shuffle trying to obtain block data from another worker node and failing to
do so in reasonable amount of time. It may still be memory related, but I'm
not
If you're using the spark partition id directly as the key, then you don't
need to access offset ranges at all, right?
You can create a single instance of a partitioner in advance, and all it
needs to know is the number of partitions (which is just the count of all
the kafka topic/partitions).
On
Does it happen every time you read a parquet source?
On Tue, Jun 2, 2015 at 3:42 AM, Anders Arpteg arp...@spotify.com wrote:
The log is from the log aggregation tool (hortonworks, yarn logs ...),
so both executors and driver. I'll send a private mail to you with the full
logs. Also, tried
It did hang for me too. High RAM consumption during build. Had to free a
lot of RAM and introduce swap memory just to get it build in my 3rd attempt.
Everything else looks fine. You can download the prebuilt versions from the
Spark homepage to save yourself from all this trouble.
Thanks,
Ritesh
Ahh, this did the trick, I had to get the name node out of same mode
however before it fully worked.
Thanks!
On Tue, Jun 2, 2015 at 12:09 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
It says your namenode is down (connection refused on 8020), you can restart
your HDFS by going into hadoop
So in spark is after acquiring executors from ClusterManeger, does tasks
are scheduled on executors based on datalocality ?I Mean if in an
application there are 2 jobs and output of 1 job is used as input of
another job.
And in job1 I did persist on some RDD, then while running job2 will it use
\
I downloaded the latest Spark (1.3.) from github. Then I tried to build it.
First for scala 2.10 (and hadoop 2.4):
build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
That resulted in hangup after printing bunch of line like
[INFO] Dependency-reduced POM written at
For the impatient R-user, here is a link
http://people.apache.org/~pwendell/spark-nightly/spark-1.4-docs/latest/sparkr.html
to get started working with DataFrames using SparkR.
Or copy and paste this link into your web browser:
Thanks, this works.
Hopefully I didn't miss something important with this approach.
вт, 2 июня 2015 г. в 20:15, Cody Koeninger c...@koeninger.org:
If you're using the spark partition id directly as the key, then you don't
need to access offset ranges at all, right?
You can create a single
Thanks Peter. Can you share the Tokenizer.java class for Spark 1.2.1.
Dimple
On Tue, Jun 2, 2015 at 10:51 AM, Peter Rudenko petro.rude...@gmail.com
wrote:
Hi Dimple,
take a look to existing transformers:
I am having the same problem reading JSON. There does not seem to be a way
of selecting a field that has a space, Executor Info from the Spark logs.
I suggest that we open a JIRA ticket to address this issue.
On Jun 2, 2015 10:08 AM, ayan guha guha.a...@gmail.com wrote:
I would think the
Hi everyone,
I wanted to drop a note about a newly organized developer meetup in Bay
Area: the Big Data Application Meetup (http://meetup.com/bigdataapps) and
call for speakers. The plan is for meetup topics to be focused on
application use-cases: how developers can build end-to-end solutions
Rstudio is the best IDE for running sparkR.
Instructions for this can be found at this link
https://github.com/apache/spark/tree/branch-1.4/R . You will need to set
some environment variables as described below.
*Using SparkR from RStudio*
If you wish to use SparkR from RStudio or other R
Thanks for the quick reply Ram. Will take a look at the Tokenizer code and
try it out.
Dimple
On Tue, Jun 2, 2015 at 10:42 AM, Ram Sriharsha sriharsha@gmail.com
wrote:
Hi
We are in the process of adding examples for feature transformations (
I have a situation where I have multiple tests that use Spark streaming with
Manual clock. The first run is OK and processes the data when I increment
the clock to the sliding window duration. The second test deviates and
doesn't process any data. The traces follows which indicates memory store is
You shouldn't have to persist the RDD at all, just call flatMap and reduce on
it directly. If you try to persist it, that will try to load the original dat
into memory, but here you are only scanning through it once.
Matei
On Jun 2, 2015, at 2:09 AM, Octavian Ganea octavian.ga...@inf.ethz.ch
Nice to hear from you Holden ! I ended up trying exactly that (Column) -
but I may have done it wrong :
In [*5*]: g.agg(Column(percentile(value, 0.5)))
Py4JError: An error occurred while calling o97.agg. Trace:
py4j.Py4JException: Method agg([class java.lang.String, class
Have you run zinc during build ?
See build/mvn which installs zinc.
Cheers
On Tue, Jun 2, 2015 at 12:26 PM, Ritesh Kumar Singh
riteshoneinamill...@gmail.com wrote:
It did hang for me too. High RAM consumption during build. Had to free a
lot of RAM and introduce swap memory just to get it
So for column you need to pass in a Java function, I have some sample code
which does this but it does terrible things to access Spark internals.
On Tuesday, June 2, 2015, Olivier Girardot o.girar...@lateral-thoughts.com
wrote:
Nice to hear from you Holden ! I ended up trying exactly that
Thanks for the answer, I'm currently doing exactly that.
I'll try to sum-up the usual Pandas = Spark Dataframe caveats soon.
Regards,
Olivier.
Le mar. 2 juin 2015 à 02:38, Davies Liu dav...@databricks.com a écrit :
The second one sounds reasonable, I think.
On Thu, Apr 30, 2015 at 1:42 AM,
Ryan - I sent a PR to fix your issue:
https://github.com/apache/spark/pull/6599
Edward - I have no idea why the following error happened. ContextCleaner
doesn't use any Hadoop API. Could you try Spark 1.3.0? It's supposed to
support both hadoop 1 and hadoop 2.
* Exception in thread Spark Context
Thanks so much Shixiong! This is great.
On Tue, Jun 2, 2015 at 8:26 PM Shixiong Zhu zsxw...@gmail.com wrote:
Ryan - I sent a PR to fix your issue:
https://github.com/apache/spark/pull/6599
Edward - I have no idea why the following error happened. ContextCleaner
doesn't use any Hadoop API.
Just testing with Spark 1.3, it looks like it sets the proxy correctly to
be the YARN RM host (0101)
15/06/03 10:34:19 INFO yarn.ApplicationMaster: Registered signal handlers
for [TERM, HUP, INT]
15/06/03 10:34:20 INFO yarn.ApplicationMaster: ApplicationAttemptId:
That code hasn't changed at all between 1.3 and 1.4; it also has been
working fine for me.
Are you sure you're using exactly the same Hadoop libraries (since you're
building with -Phadoop-provided) and Hadoop configuration in both cases?
On Tue, Jun 2, 2015 at 5:29 PM, Night Wolf
We are having a separate discussion about this but, I don't understand why
this is a problem? You're supposed to build Spark for Hadoop 1 if you run
it on Hadoop 1 and I am not sure that is happening here, given the error. I
do not think this should change as I do not see that it fixes something.
Spark 1.3.1, Scala 2.11.6, Maven 3.3.3, I'm behind proxy, have set my proxy
settings in maven settings.
Thanks,
On Tue, Jun 2, 2015 at 2:54 PM, Ted Yu yuzhih...@gmail.com wrote:
Can you give us some more information ?
Such as:
which Spark release you were building
what command you used
I found this :
https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/ml/feature/Tokenizer.html
which indicates the Tokenizer did exist in Spark 1.2.0 then and not in
1.2.1?
On Tue, Jun 2, 2015 at 12:45 PM, Peter Rudenko petro.rude...@gmail.com
wrote:
I'm afraid there's no such class
Can you give us some more information ?
Such as:
which Spark release you were building
what command you used
Scala version you used
Thanks
On Tue, Jun 2, 2015 at 2:50 PM, Mulugeta Mammo mulugeta.abe...@gmail.com
wrote:
building Spark is throwing errors, any ideas?
[FATAL] Non-resolvable
Hi all,
Has anyone tried to add Scripting capabilities to spark streaming using groovy?
I would like to stop the streaming context, update a transformation function
written in groovy( for example to manipulate json ), restart the streaming
context and obtain a new behavior without re-submit the
Yup, exactly.
All the workers will use local disk in addition to RAM, but maybe one thing you
need to configure is the directory to use for that. It should be set trough
spark.local.dir. By default it's /tmp, which on some machines is also in RAM,
so that could be a problem. You should set it
building Spark is throwing errors, any ideas?
[FATAL] Non-resolvable parent POM: Could not transfer artifact
org.apache:apache:pom:14 from/to central (
http://repo.maven.apache.org/maven2): Error transferring file:
repo.maven.apache.org from
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4
http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf
--
Ruslan Dautkhanov
On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
Thank you, Tathagata,
I ran dev/change-version-to-2.11.sh first.
I used the following command but didn't reproduce the error below:
mvn -DskipTests -Phadoop-2.4 -Pyarn -Phive clean package
My env: maven 3.3.1
Possibly the error was related to proxy setting.
FYI
On Tue, Jun 2, 2015 at 3:14 PM, Mulugeta Mammo
Hi,
Could someone explain the behavior of the
spark.streaming.kafka.maxRatePerPartition parameter? The doc says An
important (configuration) is spark.streaming.kafka.maxRatePerPartition which
is the maximum rate at which each Kafka partition will be read by (the)
direct API.
What is the default
Hi,
Thanks for you information. I'll give spark1.4 a try when it's released.
On Wed, Jun 3, 2015 at 11:31 AM, Tathagata Das t...@databricks.com wrote:
Could you try it out with Spark 1.4 RC3?
Also pinging, Cloudera folks, they may be aware of something.
BTW, the way I have debugged memory
I run spark application in spark standalone cluster with client deploy mode.
I want to check out the logs of my finished application, but I always get a
page telling me Application history not found - Application xxx is still in
process.
I am pretty sure that the application has indeed completed
Could you try it out with Spark 1.4 RC3?
Also pinging, Cloudera folks, they may be aware of something.
BTW, the way I have debugged memory leaks in the past is as follows.
Run with a small driver memory, say 1 GB. Periodically (maybe a script),
take snapshots of histogram and also do memory
Hi,
Thanks for you reply. Here's the top 30 entries of jmap -histo:live result:
num #instances #bytes class name
--
1: 40802 145083848 [B
2: 99264 12716112 methodKlass
3: 99264 12291480
I've been encountering something similar too. I suspected that was related
to the lineage growth of the graph/RDDs. So I checkpoint the graph every 60
Pregel rounds, after doing which my program doesn't slow down any more
(except that every checkpoint takes some extra time).
--
View this
Thanks Marcelo - looks like it was my fault. Seems when we deployed the new
version of spark it was picking up the wrong yarn site and setting the
wrong proxy host. All good now!
On Wed, Jun 3, 2015 at 11:01 AM, Marcelo Vanzin van...@cloudera.com wrote:
That code hasn't changed at all between
I want to do this
val qtSessionsWithQt = rawQtSession.filter(_._2.qualifiedTreatmentId !=
NULL_VALUE)
val guidUidMapSessions = rawQtSession.filter(_._2.qualifiedTreatmentId
== NULL_VALUE)
This will run two different stages can this be done in one stage ?
val (qtSessionsWithQt,
I am running a series of spark functions with 9000 executors and its
resulting in 9000+ files that is execeeding the namespace file count qutota.
How can Spark be configured to use CombinedOutputFormat.
{code}
protected def writeOutputRecords(detailRecords:
RDD[(AvroKey[DetailOutputRecord],
hello community,
i have build a jar file from my spark app with maven (mvn clean compile
assembly:single) and the following pom file:
project xmlns=http://maven.apache.org/POM/4.0.0; xmlns:xsi=
http://www.w3.org/2001/XMLSchema-instance;
xsi:schemaLocation=http://maven.apache.org/POM/4.0.0
How about other jobs? Is it an executor log, or a driver log? Could you
post other logs near this error, please? Thank you.
Best Regards,
Shixiong Zhu
2015-06-02 17:11 GMT+08:00 Anders Arpteg arp...@spotify.com:
Just compiled Spark 1.4.0-rc3 for Yarn 2.2 and tried running a job that
worked
Hello!
I have Spark running in standalone mode, and there are a bunch of worker
nodes connected to the master.
The workers have a shared file system, but the master node doesn't.
Is this something that's not going to work? i.e., should the master node
also be on the same shared filesystem mounted
You can run/submit your code from one of the worker which has access to the
file system and it should be fine i think. Give it a try.
Thanks
Best Regards
On Tue, Jun 2, 2015 at 3:22 PM, Pradyumna Achar pradyumna.ac...@gmail.com
wrote:
Hello!
I have Spark running in standalone mode, and there
The log is from the log aggregation tool (hortonworks, yarn logs ...), so
both executors and driver. I'll send a private mail to you with the full
logs. Also, tried another job as you suggested, and it actually worked
fine. The first job was reading from a parquet source, and the second from
an
Can you run using spark-submit? What is happening is that you are running a
simple java program -- you've wrapped spark-core in your fat jar but at
runtime you likely need the whole Spark system in order to run your
application. I would mark the spark-core as provided(so you don't wrap it
in your
I would think the easiest way would be to create a view in DB with column
names with no space.
In fact, you can pass a sql in place of a real table.
From documentation: The JDBC table that should be read. Note that anything
that is valid in a `FROM` clause of a SQL query can be used. For
okay, but how i can compile my app to run this without -Dconfig.file=alt_
reference1.conf?
2015-06-02 15:43 GMT+02:00 Yana Kadiyska yana.kadiy...@gmail.com:
This looks like your app is not finding your Typesafe config. The config
should usually be placed in particular folder under your app to
This looks like your app is not finding your Typesafe config. The config
should usually be placed in particular folder under your app to be seen
correctly. If it's in a non-standard location you can
pass -Dconfig.file=alt_reference1.conf to java to tell it where to look.
If this is a config that
Spark Shell:
val x =
sc.sequenceFile(/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761,
classOf[org.apache.hadoop.io.Text], classOf[org.apache.hadoop.io.Text])
OR
val x =
sc.sequenceFile(/sys/edw/dw_lstg_item/snapshot/2015/06/01/00/part-r-00761,
classOf[org.apache.hadoop.io.Text],
Like this...sqlContext should be a HiveContext instance
case class KeyValue(key: Int, value: String)
val df=sc.parallelize(1 to 50).map(i=KeyValue(i, i.toString)).toDF
df.registerTempTable(table)
sqlContext.sql(select percentile(key,0.5) from table).show()
On Tue, Jun 2, 2015 at 8:07 AM,
Basically, you need to convert it to a serializable format before doing the
collect/take.
You can fire up a spark shell and paste this:
val sFile = sc.sequenceFile[LongWritable, Text](/home/akhld/sequence
/sigmoid)
*.map(_._2.toString)*
sFile.take(5).foreach(println)
Use the
Hi list,
With the help of Spark DataFrame API we can save a DataFrame into a
database table through insertIntoJDBC() call. However, I could not find any
info about how it handles the transactional guarantee. What if my program
gets killed during the processing? Would it end up in partial load?
I've finally come to the same conclusion, but isn't there any way to call
this Hive UDAFs from the agg(percentile(key,0.5)) ??
Le mar. 2 juin 2015 à 15:37, Yana Kadiyska yana.kadiy...@gmail.com a
écrit :
Like this...sqlContext should be a HiveContext instance
case class KeyValue(key: Int,
which maven dependency i need, too??
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_vd_cdh5_maven_repo.html
Am 02.06.2015 um 16:04 schrieb Yana Kadiyska:
Can you run using spark-submit? What is happening is that you are
running a simple java program -- you've
Not super easily, the GroupedData class uses a strToExpr function which has
a pretty limited set of functions so we cant pass in the name of an
arbitrary hive UDAF (unless I'm missing something). We can instead
construct an column with the expression you want and then pass it in to
agg() that way
I think this is causing issues upgrading ADAM
https://github.com/bigdatagenomics/adam to Spark 1.3.1 (cf. adam#690
https://github.com/bigdatagenomics/adam/pull/690#issuecomment-107769383);
attempting to build against Hadoop 1.0.4 yields errors like:
2015-06-02 15:57:44 ERROR Executor:96 -
Hi all,
In my streaming job I'm using kafka streaming direct approach and want to
maintain state with updateStateByKey. My PairRDD has message's topic name +
partition id as a key. So, I assume that updateByState could work within
same partition as KafkaRDD and not lead to shuffles. Actually this
Ah, interesting. While working on my new Tungsten shuffle manager, I came
up with some nice testing interfaces for allowing me to manually trigger
spills in order to deterministically test those code paths without
requiring large amounts of data to be shuffled. Maybe I could make similar
test
Is it possible in JavaSparkContext ?
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaRDDStringlines = jsc.textFile(args[0]);
If yes , does its programmer's responsibilty to first calculate splits
locations and then instantiate spark context with preferred locations?
How does its achieved
My suggestion is that you change the Spark setting which controls the
compression codec that Spark uses for internal data transfers. Set
spark.io.compression.codec
to lzf in your SparkConf.
On Mon, Jun 1, 2015 at 8:46 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
Hello Josh,
Are you suggesting
I think Spark doesn't keep historical metrics. You can use something like
SPM for that -
http://blog.sematext.com/2014/01/30/announcement-apache-storm-monitoring-in-spm/
Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr Elasticsearch Support *
The programs and schedules for Scala By the Bay (SBTB) and Big Data Scala
By the Bay (BDS) 2015 conferences are announced and published:
Scala By the Bay http://scala.bythebay.io — August 13-16 (
scala.bythebay.io)
Big Data Scala By the Bay http://bigdatascala.bythebay.io — August 16-18 (
Hi,
I would like to embed my own transformer in the Spark.ml Pipleline but do
not see an example of it. Can someone share an example of which
classes/interfaces I need to extend/implement in order to do so. Thanks.
Dimple
--
View this message in context:
It is not possible with JavaSparkContext either. The API mentioned below
currently does not have any effect (we should document this).
The primary difference between MR and Spark here is that MR runs each task
in its own YARN container, while Spark runs multiple tasks within an
executor, which
Cody,
Thanks, good point. I fixed getting partition id to:
class MyPartitioner(offsetRanges: Array[OffsetRange]) extends Partitioner {
override def numPartitions: Int = offsetRanges.size
override def getPartition(key: Any): Int = {
// this is set in .map(m =
Hi
We are in the process of adding examples for feature transformations (
https://issues.apache.org/jira/browse/SPARK-7546) and this should be
available shortly on Spark Master.
In the meanwhile, the best place to start would be to look at how the
Tokenizer works here:
We use Scoobi + MR to perform joins and we particularly use blockJoin() API
of scoobi
/** Perform an equijoin with another distributed list where this list is
considerably smaller
* than the right (but too large to fit in memory), and where the keys of
right may be
* particularly skewed. */
Hi,
I'm trying to save SparkSQL DataFrame to a persistent Hive table using
the default parquet data source.
I don't know how to change the replication factor of the generated
parquet files on HDFS.
I tried to set dfs.replication on HiveContext but that didn't work.
Any suggestions are
Hi,
small mock data doesn't reproduce the problem. IMHO problem is reproduced
when we make shuffle big enough to split data into disk.
We will work on it to understand and reproduce the problem(not first
priority though...)
On 1 June 2015 at 23:02, Josh Rosen rosenvi...@gmail.com wrote:
How
86 matches
Mail list logo