Hello.
I have a number of static Arrays and Maps in my Spark Streaming driver
program.
They are simple collections, initialized with integer values and strings
directly in the code. There is no RDD/DStream involvement here.
I do not expect them to contain more than 100 entries, each.
They are
Hi,
Yes, if they are not big, it's a good practice to broadcast them to avoid
serializing them each time you use clojure.
Paolo
Inviata dal mio Windows Phone
Da: frodo777mailto:roberto.vaquer...@bitmonlab.com
Inviato: 26/01/2015 14:34
A:
hi all,
I am trying to create a spark context programmatically, using
org.apache.spark.deploy.SparkSubmit. It all looks OK, except that the hadoop
config that is created during the process is not picking up core-site.xml,
so it defaults back to the local file-system. I have set HADOOP_CONF_DIR in
Hi,
is it possible to mix hosts with (significantly) different specs within a
cluster (without wasting the extra resources)? for example having 10 nodes with
36GB RAM/10CPUs now trying to add 3 hosts with 128GB/10CPUs - is there a way to
utilize the extra memory by spark executors (as my
Thanks. Turns out this is a proxy problem somehow. Sorry to bother you.
/Håkan
On Mon Jan 26 2015 at 11:02:18 AM Franc Carter franc.car...@rozettatech.com
wrote:
AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember
Using Spark 1.2.0, we are facing some weird behaviour when performing self
join on a table with some ArrayType field.
(potential bug ?)
I have set up a minimal non working example here:
https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
Ah i think for locally you should give the full hdfs URL. like
val logs = sc.textFile(hdfs://akhldz:9000/sigmoid/logs)
Thanks
Best Regards
On Mon, Jan 26, 2015 at 9:36 PM, Tamas Jambor jambo...@gmail.com wrote:
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning
that
Its more like, Spark is not able to find the hadoop jars. Try setting the
HADOOP_CONF_DIR and also make sure *-site.xml are available in the
CLASSPATH/SPARK_CLASSPATH.
Thanks
Best Regards
On Mon, Jan 26, 2015 at 7:28 PM, Staffan staffan.arvids...@gmail.com
wrote:
I'm using Maven and Eclipse to
You can create a partitioned hive table using Spark SQL:
http://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables
On Mon, Jan 26, 2015 at 5:40 AM, Danny Yates da...@codeaholics.org wrote:
Hi,
I've got a bunch of data stored in S3 under directories like this:
thanks for the reply. I have tried to add SPARK_CLASSPATH, I got a warning
that it was deprecated (didn't solve the problem), also tried to run with
--driver-class-path, which did not work either. I am trying this locally.
On Mon Jan 26 2015 at 15:04:03 Akhil Das ak...@sigmoidanalytics.com
Here is the first error I get at the executors:
15/01/26 17:27:04 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[handle-message-executor-16,5,main]
java.lang.StackOverflowError
at
You can also trying adding the core-site.xml in the SPARK_CLASSPATH, btw
are you running the application locally? or in standalone mode?
Thanks
Best Regards
On Mon, Jan 26, 2015 at 7:37 PM, jamborta jambo...@gmail.com wrote:
hi all,
I am trying to create a spark context programmatically,
should have said I am running as yarn-client. all I can see is specifying the
generic executor memory that is then to be used in all containers.
On Monday, 26 January 2015, 16:48, Charles Feduke
charles.fed...@gmail.com wrote:
You should look at using Mesos. This should abstract
When you say remote cluster you need to make sure a few things like:
- No firewall/network is blocking any connection (Simply ping from
localmachine to remote ip and vice versa)
- Make sure all ports (unless you specify them manually) are open.
You can also refer this discussion,
Hi all,
I meet below error when I cache a partitioned Parquet table. It seems that,
Spark is trying to extract the partitioned key in the Parquet file, so it
is not found. But other query could run successfully, even request the
partitioned key. Is it a bug in SparkSQL? Is there any workaround
You should look at using Mesos. This should abstract away the individual
hosts into a pool of resources and make the different physical
specifications manageable.
I haven't tried configuring Spark Standalone mode to have different specs
on different machines but based on spark-env.sh.template:
#
It seems likely that there is some sort of bug related to the reuse of
array objects that are returned by UDFs. Can you open a JIRA?
I'll also note that the sql method on HiveContext does run HiveQL
(configured by spark.sql.dialect) and the hql method has been deprecated
since 1.1 (and will
Hi,
I've got a bunch of data stored in S3 under directories like this:
s3n://blah/y=2015/m=01/d=25/lots-of-files.csv
In Hive, if I issue a query WHERE y=2015 AND m=01, I get the benefit that
it only scans the necessary directories for files to read.
As far as I can tell from searching and
You are creating a HiveContext, then using the sql method instead of hql.
Is that deliberate?
The code doesn't work if you replace HiveContext with SQLContext. Lots of
exceptions are thrown, but I don't have time to investigate now.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd
I'm using Maven and Eclipse to build my project. I'm letting Maven download
all the things I need for running everything, which has worked fine up until
now. I need to use the CDK library (https://github.com/egonw/cdk,
http://sourceforge.net/projects/cdk/) and once I add the dependencies to my
Currently no if you don't want to use Spark SQL's HiveContext. But we're
working on adding partitioning support to the external data sources API,
with which you can create, for example, partitioned Parquet tables
without using Hive.
Cheng
On 1/26/15 8:47 AM, Danny Yates wrote:
Thanks
Hi Andreas,
There unfortunately is not a Python API yet for distributed matrices or
their operations. Here's the JIRA to follow to stay up-to-date on it:
https://issues.apache.org/jira/browse/SPARK-3956
There are internal wrappers (used to create the Python API), but they are
not really public
Good to hear there will be partitioning support. I’ve had some success loading
partitioned data specified with Unix glowing format. i.e.:
sc.textFile(s3:/bucket/directory/dt=2014-11-{2[4-9],30}T00-00-00”)
would load dates 2014-11-24 through 2014-11-30. Not the most ideal solution,
but it
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the Mesos cluster.
For example:
This is a job that has spark.cores.max = 4 and spark.executor.memory=3g
Hi Antony,
Unfortunately, all executors for any single Spark application must have the
same amount of memory. It's possibly to configure YARN with different
amounts of memory for each host (using
yarn.nodemanager.resource.memory-mb), so other apps might be able to take
advantage of the extra
Where is the history server running? Is it running on the same node as the
logs directory.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-webUI-application-details-page-tp3490p21374.html
Sent from the Apache Spark User List mailing list archive at
Hi Xu-dong,
Thats probably because your table's partition path don't look like
hdfs://somepath/key=value/*.parquet. Spark is trying to extract the
partition key's value from the path while caching and hence the exception
is being thrown since it can't find one.
On Mon, Jan 26, 2015 at 10:45 AM,
Thanks Michael.
I'm not actually using Hive at the moment - in fact, I'm trying to avoid it
if I can. I'm just wondering whether Spark has anything similar I can
leverage?
Thanks
It looks like something weird is going on with your object serialization,
perhaps a funny form of self-reference which is not detected by
ObjectOutputStream's typical loop avoidance. That, or you have some data
structure like a linked list with a parent pointer and you have many
thousand elements.
Hi,
This seems to be a known issue (see here:
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-failure-with-size-gt-Integer-MAX-VALUE-td19982.html)
The data set is about 1.5 T bytes. There are 14 region servers. I am not
sure how many regions there are for this data set. But very likely
(looks like the list didn't like a HTML table on the previous email. My
excuses for any duplicates)
Hi,
We are observing with certain regularity that our Spark jobs, as Mesos
framework, are hoarding resources and not releasing them, resulting in
resource starvation to all jobs running on the
We are trying to create a Spark job that writes out a file to S3 that
leverage S3's server side encryption for sensitive data. Typically this is
accomplished by setting the appropriate header on the put request, but it
isn't clear whether this capability is exposed in the Spark/Hadoop APIs.
Does
Does anyone know if I can save a RDD as a text file to a pre-created directory
in S3 bucket?
I have a directory created in S3 bucket: //nexgen-software/dev
When I tried to save a RDD as text file in this directory:
rdd.saveAsTextFile(s3n://nexgen-software/dev/output);
I got following
Hello Sean and Akhil,
I shut down the services on Cloudera Manager. I shut them down in the
appropriate order and then stopped all services of CM. I then shut down my
instances. I then turned my instances back on, but I am getting the same
error.
1) I tried hadoop fs -safemode leave and it said
This might be helpful:
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
On Tue Jan 27 2015 at 07:45:18 Sharon Rapoport sha...@plaid.com wrote:
Hi,
I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
I got them by
Hello everyone!
I try execute select 2/3 and I get 0.. Is there any way
to cast double to int or something similar?
Also it will be cool to get list of functions supported by spark sql.
Thanks!
--
View this message in context:
Have you tried floor() or ceil() functions ?
According to http://spark.apache.org/sql/, Spark SQL is compatible with
Hive SQL.
Cheers
On Mon, Jan 26, 2015 at 8:29 PM, 1esha alexey.romanc...@gmail.com wrote:
Hello everyone!
I try execute select 2/3 and I get 0.. Is there any
All,
I recently try to build Spark-1.2 on my enterprise server (which has Hadoop
2.3 with YARN). Here're the steps I followed for the build:
$ mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
$ export SPARK_HOME=/path/to/spark/folder
$ export
Your output folder specifies
rdd.saveAsTextFile(s3n://nexgen-software/dev/output);
So it will try to write to /dev/output which is as expected. If you create
the directory /dev/output upfront in your bucket, and try to save it to
that (empty) directory, what is the behaviour?
On Tue, Jan 27,
Hi,
I have an rdd of [k,v] pairs. I want to save each [v] to a file named [k].
I got them by combining many [k,v] by [k]. I could then save to file by
partitions, but that still doesn't allow me to choose the name, and leaves
me stuck with foo/part-...
Any tips?
Thanks,
Sharon
Try using an absolute path to the pem file
On Jan 26, 2015, at 8:57 PM, ey-chih chow eyc...@hotmail.com wrote:
Hi,
I used the spark-ec2 script of spark 1.2 to launch a cluster. I have
modified the script according to
Command would be:
hadoop dfsadmin -safemode leave
If you are not able to ping your instances, it can be because of you are
blocking all the ICMP requests. Im not quiet sure why you are not able to
ping google.com from your instances. Make sure the internal IP (ifconfig)
is proper in the
When spark saves rdd to a text file, the directory must not exist upfront. It
will create a directory and write the data to part- under that directory.
In my use case, I create a directory dev in the bucket ://nexgen-software/dev .
I expect it creates output direct under dev and a part-
Hi, any one can show me some examples using UDAF for spark sqlcontext?
I have tried select ceil(2/3), but got key not found: floor
On Tue, Jan 27, 2015 at 11:05 AM, Ted Yu yuzhih...@gmail.com wrote:
Have you tried floor() or ceil() functions ?
According to http://spark.apache.org/sql/, Spark SQL is compatible with
Hive SQL.
Cheers
On Mon, Jan 26, 2015 at
Hi,
I used the spark-ec2 script of spark 1.2 to launch a cluster. I have
modified the script according to
https://github.com/grzegorz-dubicki/spark/commit/5dd8458d2ab9753aae939b3bb33be953e2c13a70
But the script was still hung at the following message:
Waiting for cluster to enter 'ssh-ready'
By default, the files will be created under the path provided as the
argument for saveAsTextFile. This argument is considered as a folder in the
bucket and actual files are created in it with the naming convention
part-n, where n is the number of output partition.
On Mon, Jan 26, 2015 at
Awesome ! That would be great !!
On Mon, Jan 26, 2015 at 3:18 PM, Michael Armbrust mich...@databricks.com
wrote:
I'm aiming for 1.3.
On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Thanks Michael. I am sure there have been many requests for this support.
Any
Thanks. But after setting spark.shuffle.blockTransferService to nio
application fails with Akka Client disassociation.
15/01/27 13:38:11 ERROR TaskSchedulerImpl: Lost executor 3 on
wynchcs218.wyn.cnw.co.nz: remote Akka client disassociated
15/01/27 13:38:11 INFO TaskSetManager: Re-queueing tasks
AMI's are specific to an AWS region, so the ami-id of the spark AMI in
us-west will be different if it exists. I can't remember where but I have a
memory of seeing somewhere that the AMI was only in us-east
cheers
On Mon, Jan 26, 2015 at 8:47 PM, Håkan Jonsson haj...@gmail.com wrote:
Thanks,
Hi, San
You need to provide more information to diagnose this problem, like :
1. What kind of SQL did you execute?
2. If there are some |group| operation in this SQL, could you do some
statistic about how many unique group keys in this case?
On 1/26/15 17:01, luohui20...@sina.com wrote:
I use this: http://scala-ide.org/
I also use Maven with this archetype:
https://github.com/davidB/scala-archetype-simple. To be frank though, you
should be fine using SBT.
On Sat, Jan 24, 2015 at 6:33 PM, riginos samarasrigi...@gmail.com wrote:
How to compile a Spark project in Scala IDE for
TinkerPop has become an Apache Incubator project and seems to have Spark
in mind in their proposal
https://wiki.apache.org/incubator/TinkerPopProposal.
That's good news!
I hope there will be nice collaborations between the communities.
On Wed, Jan 7, 2015 at 11:31 AM, Nicolas Colson
Thanks,
I also use Spark 1.2 with prebuilt for Hadoop 2.4. I launch both 1.1 and
1.2 with the same command:
./spark-ec2 -k foo -i bar.pem launch mycluster
By default this launches in us-east-1. I tried changing the the region
using:
-r us-west-1 but that had the same result:
Could not resolve
Hello,
we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each
of them running one Worker and one of them also running the Master; they are
all connected to the same HDFS instance.
Until a few days ago, they were all configured with
SPARK_WORKER_MEMORY = 18G
I have also thought that Hadoop mapper output result is saved on HDFS, at least
if the job only has Mapper but doesn't have Reducer.
If there is reducer, then the map output will be saved on local disk?
From: Shao, Saisai
Date: 2015-01-26 15:23
To: Larry Liu
CC:
If there is no Reducer, there is no shuffle. The Mapper output goes to
HDFS, yes. But the question here is about shuffle files, right? Those
are written by the Mapper to local disk. Reducers load them from the
Mappers over the network then. Shuffle files do not go to HDFS.
On Mon, Jan 26, 2015 at
I definitely have Spark 1.2 running within EC2 using the spark-ec2 scripts.
I downloaded Spark 1.2 with prebuilt for Hadoop 2.4 and later.
What parameters are you using when you execute spark-ec2?
I am launching in the us-west-1 region (ami-7a320f3f) which may explain
things.
On Mon Jan 26 2015
Hello,
we are running Spark 1.2.0 standalone on a cluster made up of 4 machines, each
of them running one Worker and one of them also running the Master; they are
all connected to the same HDFS instance.
Until a few days ago, they were all configured with
SPARK_WORKER_MEMORY = 18G
I am using SBT
On 26 Jan 2015 15:54, Luke Wilson-Mawer lukewilsonma...@gmail.com wrote:
I use this: http://scala-ide.org/
I also use Maven with this archetype:
https://github.com/davidB/scala-archetype-simple. To be frank though, you
should be fine using SBT.
On Sat, Jan 24, 2015 at 6:33
AFAIK ordering is not strictly guaranteed unless the RDD is the
product of a sort. I think that in practice, you'll never find
elements of a file read in some random order, for example (although
see the recent issue about partition ordering potentially depending on
how the local file system lists
Hi,
What do your jobs do? Ideally post source code, but some description would
already helpful to support you.
Memory leaks can have several reasons - it may not be Spark at all.
Thank you.
Le 26 janv. 2015 22:28, Gerard Maas gerard.m...@gmail.com a écrit :
(looks like the list didn't like
I'm not actually using Hive at the moment - in fact, I'm trying to avoid
it if I can. I'm just wondering whether Spark has anything similar I can
leverage?
Let me clarify, you do not need to have Hive installed, and what I'm
suggesting is completely self-contained in Spark SQL. We support
Hi Jörn,
A memory leak on the job would be contained within the resources reserved
for it, wouldn't it?
And the job holding resources is not always the same. Sometimes it's one of
the Streaming jobs, sometimes it's a heavy batch job that runs every hour.
Looks to me that whatever is causing the
I'm aiming for 1.3.
On Mon, Jan 26, 2015 at 3:05 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Thanks Michael. I am sure there have been many requests for this support.
Any release targeted for this?
Thanks,
On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com
Thanks Michael. I am sure there have been many requests for this support.
Any release targeted for this?
Thanks,
On Sat, Jan 24, 2015 at 11:47 AM, Michael Armbrust mich...@databricks.com
wrote:
Those annotations actually don't work because the timestamp is SQL has
optional nano-second
Ah, well that is interesting. I'll experiment further tomorrow. Thank you for
the info!
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
Hi,
I do'nt have any history server running. As SK's already pointed in a
previous post the history server seems to be required only in mesos or yarn
mode, not in standalone mode.
https://spark.apache.org/docs/1.1.1/monitoring.html
If Spark is run on Mesos or YARN, it is still possible to
68 matches
Mail list logo