Could you be looking at 2 jobs trying to use the same file and one getting to
it before the other and finally removing it?
David Newberger
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Wednesday, June 8, 2016 1:33 PM
To: user; user @spark
Subject: Creating a Hive table through
Hi Mich,
My gut says you are correct that each application should have its own
checkpoint directory. Though honestly I’m a bit fuzzy on checkpointing still as
I’ve not worked with it much yet.
Cheers,
David Newberger
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Friday, June
I was going to ask if you had 2 jobs running. If the checkpointing for both are
setup to look at the same location I could see an error like this happening. Do
both spark jobs have a reference to a checkpointing dir?
David Newberger
From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent
Have you tried UseG1GC in place of UseConcMarkSweepGC? This article really
helped me with GC a few short weeks ago
https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
David Newberger
-Original Message-
From: Marco1982 [mailto:marco.plata
rk, it is cloned and can no
longer be modified by the user. Spark does not support modifying the
configuration at runtime.
“
David Newberger
From: Alonso Isidoro Roman [mailto:alons...@gmail.com]
Sent: Friday, June 3, 2016 10:37 AM
To: David Newberger
Cc: user@spark.apache.org
Subject: Re: About
What does your processing time look like. Is it consistently within that 20sec
micro batch window?
David Newberger
From: Adrian Tanase [mailto:atan...@adobe.com]
Sent: Friday, June 3, 2016 8:14 AM
To: user@spark.apache.org
Cc: Cosmin Ciobanu
Subject: [REPOST] Severe Spark Streaming performance
Alonso,
The CDH VM uses YARN and the default deploy mode is client. I’ve been able to
use the CDH VM for many learning scenarios.
http://www.cloudera.com/documentation/enterprise/latest.html
http://www.cloudera.com/documentation/enterprise/latest/topics/spark.html
David Newberger
From
Have you tried it without either of the setMaster lines?
Also, CDH 5.7 uses spark 1.6.0 with some patches. I would recommend using the
cloudera repo for spark files in build sbt. I’d also check other files in the
build sbt to see if there are cdh specific versions.
David Newberger
From
Is
https://github.com/alonsoir/awesome-recommendation-engine/blob/master/build.sbt
the build.sbt you are using?
David Newberger
QA Analyst
WAND - The Future of Restaurant Technology
(W) www.wandcorp.com<http://www.wandcorp.com/>
(E) david.newber...@wandcorp.com<mailto:dav
Hi All,
The error you are seeing looks really similar to Spark-13514 to me. I could be
wrong though
https://issues.apache.org/jira/browse/SPARK-13514
Can you check yarn.nodemanager.local-dirs in your YARN configuration for
"file://"
Cheers!
David Newberger
-Original Message
Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do you
literally mean replacing the whole of Hadoop?
David
From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Thursday, April 14, 2016 2:13 PM
To: User
Subject: Spark replacing Hadoop
Hi,
I hear that some
Hi Natu,
I believe you are correct one RDD would be created for each file.
Cheers,
David
From: Natu Lauchande [mailto:nlaucha...@gmail.com]
Sent: Tuesday, April 12, 2016 1:48 PM
To: David Newberger
Cc: user@spark.apache.org
Subject: Re: DStream how many RDD's are created by batch
Hi
Hi,
Time is usually the criteria if I’m understanding your question. An RDD is
created for each batch interval. If your interval is 500ms then an RDD would be
created every 500ms. If it’s 2 seconds then an RDD is created every 2 seconds.
Cheers,
David
From: Natu Lauchande [mailto:nlaucha
Thanks much, Akhil. iptables is certainly a bandaid, but from an OpSec
perspective, it's troubling.
Is there any way to limit which interfaces the WebUI listens on? Is there a
Jetty configuration that I'm missing?
Thanks again for your help,
David
On Wed, Mar 30, 2016 at 2:25 AM,
id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 28 March 2016 at 15:32, David O'Gwynn wrote:
>
>> Greetings to all,
>>
>> I've search around the mailing list, but it would seem that (ne
Greetings to all,
I've search around the mailing list, but it would seem that (nearly?)
everyone has the opposite problem as mine. I made a stab at looking in the
source for an answer, but I figured I might as well see if anyone else has
run into the same problem as I.
I'm trying to limit my Mast
e-detector {
heartbeat-interval = 4 s
acceptable-heartbeat-pause = 16 s
}
}
}
.set("spark.akka.heartbeat.interval", "4s")
.set("spark.akka.heartbeat.pauses", "16s")
On Tue, Mar 15, 2016 at 9:50 PM, David Gomez Saavedra
wrote:
> hi th
ader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
16/03/15 20:48:12 WARN ReliableDeliverySupervisor: Association with
remote system [akka.tcp://spark-engine@spark-engine:9083] has failed,
address is now gated for [5000] ms. Reason: [Disassociated]
Any idea why the two actor systems get disassociated ?
Thank you very much in advanced.
Best
David
The issue is related to this
https://issues.apache.org/jira/browse/SPARK-13906
.set("spark.rpc.netty.dispatcher.numThreads","2")
seem to fix the problem
On Tue, Mar 15, 2016 at 6:45 AM, David Gomez Saavedra
wrote:
> I have updated the config since I realized the act
If you are using sbt, I personally use sbt-pack to pack all dependencies
under a certain folder and then I set those jars in the spark config
// just for demo I load this through config file overridden by environment
variables
val sparkJars = Seq
("/ROOT_OF_YOUR_PROJECT/target/pack/lib/YOUR_JAR_DE
-
tcp6 0 0 :::6005 :::*LISTEN
-
tcp6 0 0 172.18.0.2:6006 :::*LISTEN
-
tcp6 0 0 172.18.0.2: :::*LISTEN
-
so far still no success
On Mon,
002 7003 7004 7005 7006
I'm using those images docker images to run spark jobs without a problem. I
only get errors on the streaming app.
any pointers on what can be wrong?
Thank you very much in advanced.
David
fka/libs/metrics-core-2.2.0.jar,'
'/usr/share/java/mysql.jar')
got the logging to admit to adding the jars to the http server (just as for
the spark submit output above) but leaving the other config options in
place or removing them the class is still not found.
Is this not possible in python?
Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's
deprecated and ignored anyway) and I cannot find anything else to try.
Can anybody help?
David K.
fka/libs/metrics-core-2.2.0.jar,'
'/usr/share/java/mysql.jar')
got the logging to admit to adding the jars to the http server (just as for
the spark submit output above) but leaving the other config options in
place or removing them the class is still not found.
Is this not possible in python?
Incidentally, I have tried SPARK_CLASSPATH (getting the message that it's
deprecated and ignored anyway) and I cannot find anything else to try.
Can anybody help?
David K.
the vision is to get rid of all cluster
> management when using Spark.
You might find one of the hosted Spark platform solutions such as
Databricks or Amazon EMR that handle cluster management for you a good
place to start. At least in my experience, they got me
the artifacts for your package to Maven central.
David
On Mon, Feb 1, 2016 at 7:03 AM, Praveen Devarao wrote:
> Hi,
>
> Is there any guidelines or specs to write a Spark package? I would
> like to implement a spark package and would like to know the way it needs to
> be
>
ROSE Spark Package: https://github.com/onetapbeyond/opencpu-spark-executor
<https://github.com/onetapbeyond/opencpu-spark-executor>
Questions, suggestions, feedback welcome.
David
--
"*All that is gold does not glitter,** Not all those who wander are lost."*
/apache/spark/commit/2388de51912efccaceeb663ac56fc500a79d2ceb
This should resolve the issue I'm experiencing. I'll get hold of a build
from source and try it out.
Thanks for all your help!
David
On Wed, Jan 27, 2016 at 12:51 AM Ram Sriharsha
wrote:
> btw, OneVsRest is using the
g JIRAs and
getting patches tomorrow morning. It's late here!
Thanks for the swift response,
David
On Tue, Jan 26, 2016 at 11:09 PM Ram Sriharsha
wrote:
> Hi David
>
> If I am reading the email right, there are two problems here right?
> a) for rare classes the random spli
with using the label metadata as a shortcut.
Do you agree that there is an issue here? Would you accept contributions
to the code to remedy it? I'd gladly take a look if I can be of help.
Many thanks,
David
On Tue, Jan 26, 2016 at 1:29 PM David Brooks wrote:
> Hi Ram,
>
> I did
and issue. I'm happy to try a simpler method for
providing column metadata, if one is available.
Thanks,
David
On Mon, Jan 25, 2016 at 11:13 PM Ram Sriharsha
wrote:
> Hi David
>
> What happens if you provide the class labels via metadata instead of
> letting OneVsRest de
ry rare classes.
I'm happy to look into patching the code, but I first wanted to confirm
that the problem was real, and that I wasn't somehow misunderstanding how I
should be using OneVsRest.
Any guidance would be appreciated - I'm new to the list.
Many thanks,
David
The foreach operation on RDD has a void (Unit) return type. See attached. So
there is no return value to the driver.
David
"All that is gold does not glitter, Not all those who wander are lost."
Original Message
Subject: rdd.foreach return value
Local Time: Janua
Yep that's exactly what we want. Thanks for all the info Cody.
Dave.
On 13 Jan 2016 18:29, "Cody Koeninger" wrote:
> The idea here is that the custom partitioner shouldn't actually get used
> for repartitioning the kafka stream (because that would involve a shuffle,
> which is what you're trying
PIs in Java, JavaScript
and .NET that can easily support your use case. The outputs of your DeployR
integration could then become inputs to your data processing system.
David
"All that is gold does not glitter, Not all those who wander are lost."
Original Message
Subject: R
weight as ROSE and it not
designed to work in a clustered environment. ROSE on the other hand is designed
for scale.
David
"All that is gold does not glitter, Not all those who wander are lost."
Original Message
Subject: Re: ROSE: Spark + R on the JVM.
Local Time:
Hi Corey,
> Would you mind providing a link to the github?
Sure, here is the github link you're looking for:
https://github.com/onetapbeyond/opencpu-spark-executor
David
"All that is gold does not glitter, Not all those who wander are lost."
Original Message --
ou to [take a
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback,
questions etc very welcome.
David
"All that is gold does not glitter, Not all those who wander are lost."
ou to [take a
look](https://github.com/onetapbeyond/opencpu-spark-executor). Any feedback,
questions etc very welcome.
David
"All that is gold does not glitter, Not all those who wander are lost."
on 2.7. Some libraries that Spark depend on
>>> stopped supporting 2.6. We can still convince the library maintainers to
>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>> Python 2.6 to run Spark.
>>>
>>> Thanks.
>>>
>>>
>>>
>>
--
David Chin, Ph.D.
david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
http://www.drexel.edu/research/urcf/
https://linuxfollies.blogspot.com/
+1.215.221.4747 (mobile)
https://github.com/prehensilecode
used this approach yet
and if so what has you experience been with using it? If it helps we'd be
looking to implement it using Scala. Secondly, in general what has people
experience been with using experimental features in Spark?
Cheers,
David Newberger
015 at 5:33 PM, David John wrote:
I have used Spark 1.4 for 6 months. Thanks all the members of this
community for your great work.I have a question about the logging issue. I
hope this question can be solved.
The program is running under this configurations: YARN Cluster, YARN-client
m
I have used Spark 1.4 for 6 months. Thanks all the members of this
community for your great work.I have a question about the logging issue. I
hope this question can be solved.
The program is running under this configurations: YARN Cluster, YARN-client
mode.
In Scala,writing a code like:rdd.
s the Maven manifest goes, I'm really not sure. I will research it
though. Now I'm wondering if my mergeStrategy is to blame? I'm going to
try there next.
Thank you for the help!
On Tue, Dec 22, 2015 at 1:18 AM, Igor Berman wrote:
> David, can you verify that mysql connect
; MergeStrategy.discard
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList("org", "apache", xs @ _*) => MergeStrategy.first
case PathList("org", "jboss", xs @ _*) => MergeStrategy.first
case
ll that does
"sqlContext.load("jdbc", myOptions)". I know this is a total newbie
question but in my defense, I'm fairly new to Scala, and this is my first
go at deploying a fat jar with sbt-assembly.
Thanks for any advice!
--
David Yerrington
yerrington.net
Hi Eran,
Based on the limited information the first things that come to my mind are
Processor, RAM, and Disk speed.
David Newberger
QA Analyst
WAND - The Future of Restaurant Technology
(W) www.wandcorp.com<http://www.wandcorp.com/>
(E) david.newber...@wandcorp.com<mailto:dav
Hello Spark experts,
We are currently evaluating Spark on our cluster that already supports MRv2
over YARN.
We have noticed a problem with running jobs concurrently, in particular
that a running Spark job will not release its resources until the job is
finished. Ideally, if two people run any co
I ran into this recently. Turned out we had an old
org-xerial-snappy.properties file in one of our conf directories that
had the setting:
# Disables loading Snappy-Java native library bundled in the
# snappy-java-*.jar file forcing to load the Snappy-Java native
# library from the java.library
A graph is nodes and vertices. What else are you expecting to save/load? You
could save/load the triplets, but that is actually more work to reconstruct the
graph than the nodes and vertices separately.
Dave
From: Gaurav Kumar [mailto:gauravkuma...@gmail.com]
Sent: Friday, November 13, 2015
I have verified that this error exists on my system as well, and the suggested
workaround also works.
Spark version: 1.5.1; 1.5.2
Mesos version: 0.21.1
CDH version: 4.7
I have set up the spark-env.sh to contain HADOOP_CONF_DIR pointing to the
correct place, and I have also linked in the hdfs-si
work between daily increased large tables,
>> for
>>
>> both spark sql and cassandra. I can see that the [1] use case facilitates
>> FiloDB to achieve columnar storage and query performance, but we had
>> nothing more
>>
>> knowledge.
>>
>>
I have a Spark Streaming job that runs great the first time around (Elastic
MapReduce 4.1.0), but when recovering from a checkpoint in S3, the job runs
but Spark itself seems to be jacked-up in lots of little ways:
- Executors, which are normally stable for days, are terminated within a
coup
Got it working! Thank you for confirming my suspicion that this issue was
related to Java. When I dug deeper I found multiple versions and some other
issues. I worked on it a while before deciding it would be easier to just
uninstall all Java and reinstall clean JDK, and now it works perfectly.
as java8u60
I double checked my python version and it appears to be 2.7.10
I am familiar with command line, and have background in hadoop, but this has
me stumped.
Thanks in advance,
David Bess
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Insta
your code to make is use less memory.
David
On Tue, Oct 6, 2015 at 3:19 PM, unk1102 wrote:
> Hi I have a Spark job which runs for around 4 hours and it shared
> SparkContext and runs many child jobs. When I see each job in UI I see
> shuffle spill of around 30 to 40 GB and because of
I am using spark stream to receive data from kafka, and then write result rdd
to external database inside foreachPartition(). All thing works fine, my
question is how can we handle no data loss if there is database connection
failure, or other exception happened during write data to external storag
Storm writes the data to both cassandra and kafka, spark reads the
>>> actual data from kafka , processes the data and writes to cassandra.
>>> The second approach avoids additional hit of reading from cassandra
>>> every minute , a device has written data to cassandra at the
o be a bug introduced in 1.3. Hopefully
it¹s fixed in 1.4.
Thanks,
Charles
On 9/9/15, 7:30 AM, "David Rosenstrauch" wrote:
Standalone.
On 09/08/2015 11:18 PM, Jeff Zhang wrote:
What cluster mode do you use ? Standalone/Yarn/Mesos ?
On Wed, Sep 9, 2015 at 11:15 AM, David Rosens
Standalone.
On 09/08/2015 11:18 PM, Jeff Zhang wrote:
What cluster mode do you use ? Standalone/Yarn/Mesos ?
On Wed, Sep 9, 2015 at 11:15 AM, David Rosenstrauch
wrote:
Our Spark cluster is configured to write application history event logging
to a directory on HDFS. This all works fine
Our Spark cluster is configured to write application history event
logging to a directory on HDFS. This all works fine. (I've tested it
with Spark shell.)
However, on a large, long-running job that we ran tonight, one of our
machines at the cloud provider had issues and had to be terminated
Hi Ajay,
Are you trying to save to your local file system or to HDFS?
// This would save to HDFS under "/user/hadoop/counter"
counter.saveAsTextFile("/user/hadoop/counter");
David
On Sun, Aug 30, 2015 at 11:21 AM, Ajay Chander wrote:
> Hi Everyone,
>
> Recent
the code below is taken from the spark website and generates the error
detailed
Hi using spark 1.3 and trying some sample code:
val users: RDD[(VertexId, (String, String))] =
sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal",
"postdoc")),
(5L, ("franklin", "prof")), (2L, ("istoica",
behavior with the take function, or at least without needing
to choose an element randomly. I was able to get the behavior I wanted
above by just changing the seed until I got the dataframe I wanted, but I
don't think that is a good approach in general.
Any insight is appreciated.
Best,
David Mon
ments from anyone who may be doing something similar.
Cheers,
Dave
--
David Chin, Ph.D.
david.c...@drexel.eduSr. Systems Administrator, URCF, Drexel U.
http://www.drexel.edu/research/urcf/
https://linuxfollies.blogspot.com/
215.221.4747 (mobile)
https://github.com/prehensilecode
Hi, all,
I am just setting up to run Spark in standalone mode, as a (Univa) Grid
Engine job. I have been able to set up the appropriate environment
variables such that the master launches correctly, etc. In my setup, I
generate GE job-specific conf and log dirs.
However, I am finding that the SPA
This is likely due to data skew. If you are using key-value pairs, one key
has a lot more records, than the other keys. Do you have any groupBy
operations?
David
On Tue, Jul 14, 2015 at 9:43 AM, shahid wrote:
> hi
>
> I have a 10 node cluster i loaded the data onto hdfs, so t
It seems this feature was added in Hive 0.13.
https://issues.apache.org/jira/browse/HIVE-4943
I would assume this is supported as Spark is by default compiled using Hive
0.13.1.
On Sun, Jul 12, 2015 at 7:42 PM, Ruslan Dautkhanov
wrote:
> You can see what Spark SQL functions are supported in Spa
As Sean suggested you can actually build Spark 1.4 for CDH 5.4.x and also
include Hive libraries for 0.13.1, but *this will be completely unsupported
by Cloudera*.
I would suggest to do that only if you just want to experiment with new
features from Spark 1.4. I.e. Run SparkSQL with sort-merge join
You can certainly query over 4 TB of data with Spark. However, you will
get an answer in minutes or hours, not in milliseconds or seconds. OLTP
databases are used for web applications, and typically return responses in
milliseconds. Analytic databases tend to operate on large data sets, and
retu
Hi all,
Do you know if there is an option to specify how many replicas we want
while caching in memory a table in SparkSQL Thrift server? I have not seen
any option so far but I assumed there is an option as you can see in the
Storage section of the UI that there is 1 x replica of your
Dataframe/Ta
Hi chaps,
It seems there is an issue while saving dataframes in Spark 1.4.
The default file extension inside Hive warehouse folder is now
part-r-X.gz.parquet but while running queries from SparkSQL Thriftserver is
still looking for part-r-X.parquet.
Is there any config parameter we can use as wor
I am having the same problem reading JSON. There does not seem to be a way
of selecting a field that has a space, "Executor Info" from the Spark logs.
I suggest that we open a JIRA ticket to address this issue.
On Jun 2, 2015 10:08 AM, "ayan guha" wrote:
> I would think the easiest way would b
reatly appreciate pointers to some specific documentation or
examples if you have seen something like this before.
Thanks,
David
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-boundaries-and-triggering-processing-using-tags-in-the-data-tp23060.html
s used to be inherent to the
> “commercial” vendors, but I can confirm as fact it is also in effect to the
> “open source movement” (because human nature remains the same)
>
>
>
> *From:* David Morales [mailto:dmora...@stratio.com]
> *Sent:* Thursday, May 14, 2015 4:30 PM
>
ery similar… I will contact you to
> understand if we can contribute to you with some piece !
>
> Best
>
> Paolo
>
> *Da:* Evo Eftimov
> *Data invio:* giovedì 14 maggio 2015 17:21
> *A:* 'David Morales' , Matei Zaharia
>
> *Cc:* user@spark.apac
t;> Regards.
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/SPARKTA-a-real-time-aggregation-engine-based-on-Spark-Streaming-tp22883.html
> >> Sent from the Apache Spark User List mailing list archive at N
Hi!
I've been using spark for the last months and it is awesome. I'm pretty new on
this topic so don't be too harsh on me.
Recently I've been doing some simple tests with Spark Streaming for log
processing and I'm considering different ETL input solutions such as Flume or
PDI+Kafka.
My use case
how to make the magic happen with
sparkR. Anyone got any ideas?
thanks!
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com<mailto:broo...@annaisystems.com>
[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.
Does anyone know in which version of Spark will there be support for
ORCFiles via spark.sql.hive? Will it be in 1.4?
David
w00t - t/y for this! I'm currently doing a deep dive into the RDD memory
footprint under various conditions so this is timely and helpful.
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com<mailto:broo...@annaisystems.com>
[cid:AE
will do! I've got to clear with my boss what I can post and in what manner, but
I'll definitely do what I can to put some working code out into the world so
the next person who runs into this brick wall can benefit from all this :-D
DAVID HOLIDAY
Software Engineer
760 607 3300 | Offi
w0t! that did it! t/y so much!
I'm going to put together a pastebin or something that has all the code put
together so if anyone else runs into this issue they will have some working
code to help them figure out what's going on.
DAVID HOLIDAY
Software Engineer
76
se - there are 10,000 rows of data in the table I
pointed to. however, when I try to grab the first element of data thusly:
rddX.first
I get the following error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 in
stage 0.0 (TID 0) had a not serializable result:
org
see the entire thread of code, responses from
notebook, etc. I'm going to try invoking the same techniques both from within a
stand-alone scala problem and from the shell itself to see if I can get some
traction. I'll report back when I have more data.
cheers (and thx!)
DAVID HOLIDAY
Sof
hi Irfan,
thanks for getting back to me - i'll try the accumulo list to be sure. what is
the normal use case for spark though? I'm surprised that hooking it into
something as common and popular as accumulo isn't more of an every-day task.
DAVID HOLIDAY
Software Engineer
760 60
hat I haven't specified any
parameters as to which table to connect with, what the auths are, etc.
so my question is: what do I need to do from here to get those first ten rows
of table data into my RDD?
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobi
kk - I'll put something together and get back to you with more :-)
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com<mailto:broo...@annaisystems.com>
[cid:AE39C43E-3FF7-4C90-BCE4-9711C84C4CB8@cld.annailabs.com]
www.AnnaiSyste
hi all - thx for the alacritous replies! so regarding how to get things from
notebook to spark and back, am I correct that spark-submit is the way to go?
DAVID HOLIDAY
Software Engineer
760 607 3300 | Office
312 758 8385 | Mobile
dav...@annaisystems.com<mailto:broo...@annaisystems.
Thank you for your help. "toDF()" solved my first problem. And, the
second issue was a non-issue, since the second example worked without any
modification.
David
On Sun, Mar 15, 2015 at 1:37 AM, Rishi Yadav wrote:
> programmatically specifying Schema need
pache.spark.rdd.RDD[String],
org.apache.spark.sql.ty
pes.StructType)
val df = sqlContext.createDataFrame(people, schema)
Any help would be appreciated.
David
-Environment
will have you quickly up and running on a single machine without having to
manage the details of the system installations. There is a Docker version,
https://github.com/ibm-et/spark-kernel/wiki/Using-the-Docker-Container-for-the-Spark-Kernel
, if you prefer Docker.
Regards,
David
King
;s good to know, I'll certainly give it a look.
Can you give me a hint about you unzip your input files on the fly? I
thought that it wasn't possible to parallelize zipped inputs unless they
were unzipped before passing to Spark?
Joe
On 3 February 2015 at 17:48, David Rosenstrauch wro
e for practical? See "Why you
cannot use S3 as a replacement for HDFS"[0]. I'd love to be proved wrong,
though, that would make things a lot easier.
[0] http://wiki.apache.org/hadoop/AmazonS3
On 3 February 2015 at 16:45, David Rosenstrauch wrote:
You could also just push the dat
You could also just push the data to Amazon S3, which would un-link the
size of the cluster needed to process the data from the size of the data.
DR
On 02/03/2015 11:43 AM, Joe Wass wrote:
I want to process about 800 GB of data on an Amazon EC2 cluster. So, I need
to store the input in HDFS so
4, 2015 at 3:53 PM, David Jones
wrote:
> Should I be able to pass multiple paths separated by commas? I haven't
> tried but didn't think it'd work. I'd expected a function that accepted a
> list of strings.
>
> On Wed, Jan 14, 2015 at 3:20 PM, Yana Kadiyska
; Here was my question for reference:
>
> http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3ccaaswr-5rfmu-y-7htluj2eqqaecwjs8jh+irrzhm7g1ex7v...@mail.gmail.com%3E
>
> On Wed, Jan 14, 2015 at 4:34 AM, David Jones
> wrote:
>
>> Hi,
>>
>> I have a p
EMR.
If that's not possible, is there some way to load multiple avro files into
the same table/RDD so the whole dataset can be processed (and in that case
I'd supply paths to each file concretely, but I *really* don't want to have
to do that).
Thanks
David
would be if the AMP Lab or Databricks
maintained a set of benchmarks on the web that showed how much each successive
version of Spark improved.
Dave
From: Madabhattula Rajesh Kumar [mailto:mrajaf...@gmail.com]
Sent: Monday, January 12, 2015 9:24 PM
To: Buttler, David
Subject: Re: GraphX vs
I ran into this recently. Turned out we had an old
org-xerial-snappy.properties file in one of our conf directories that
had the setting:
# Disables loading Snappy-Java native library bundled in the
# snappy-java-*.jar file forcing to load the Snappy-Java native
# library from the java.library
101 - 200 of 262 matches
Mail list logo