Hi all,
I wrote an AccumulatorParam but in my job it does not seem to be adding the
values. When I tried with an int accumulator in my job the value was added
to.
object MapAccumulatorParam extends AccumulatorParam[Map[Long, Int]]{
> def zero(initialValue: Map[Long, Int] = Map.empty):
Never mind, just figured out my problem,
I was running: *deploy.ExternalShuffleService* instead of
*deploy.mesos.MesosExternalShuffleService*
- Jo Voordeckers
On Fri, Apr 15, 2016 at 2:29 PM, Jo Voordeckers
wrote:
> Forgot to mention we're running Spark (Streaming)
Thanks!
I got this to work.
val csvRdd = sc.parallelize(data.split("\n"))
val df = new
com.databricks.spark.csv.CsvParser().withUseHeader(true).withInferSchema(true).csvRdd(sqlContext,
csvRdd)
> On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon wrote:
>
> Hi,
>
> Would you try
Forgot to mention we're running Spark (Streaming) 1.5.1
- Jo Voordeckers
On Fri, Apr 15, 2016 at 12:21 PM, Jo Voordeckers
wrote:
> Hi all,
>
> I've got mesos in coarse grained mode with dyn alloc, shuffle service
> enabled and am running the shuffle service on every
Is this right?
import com.databricks.spark.csv
val csvRdd = data.flatMap(x => x.split("\n"))
val df = new CsvParser().csvRdd(sqlContext, csvRdd, useHeader = true)
Thanks,
Ben
> On Apr 15, 2016, at 1:14 PM, Hyukjin Kwon wrote:
>
> Hi,
>
> Would you try this codes
Hi,
Would you try this codes below?
val csvRDD = ...your processimg for csv rdd..
val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true)
Thanks!
On 16 Apr 2016 1:35 a.m., "Benjamin Kim" wrote:
> Hi Hyukjin,
>
> I saw that. I don’t know how to use it. I’m
looks plausible. Glad it helped
Personally I prefer ORC tables as they are arguably a better fit for
columnar tables. Others may differ on this :)
Cheers
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Hi,
Following your answer I was able to make it work.
FIY:
Basically the solution is to manually create the table in hive using a sql
“Create table” command.
When doing a saveAsTable, hive meta-store don’t get the info of the df.
So now my flow is :
* Create a dataframe
* if it is the
Hi,
is it physical server or AWS/Azure? What are the executed parameters for
spark-shell command? Hadoop distro/version and Spark version?
Kind Regards,
Jan
> On 15 Apr 2016, at 16:15, luca_guerra wrote:
>
> Hi,
> I'm looking for a solution to improve my Spark cluster
Hi all,
I've got mesos in coarse grained mode with dyn alloc, shuffle service
enabled and am running the shuffle service on every mesos slave.
I'm assuming I misconfigured something on the scheduler service, any ideas?
On my driver is see a few of these, I guess it's one for every executor :
Hi,
I found out one problem of using "spark.driver.userClassPathFirst" and
SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm
this in fact has no good solution.
We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster
doesn't have the direct
Hi,
I found out one problem of using "spark.driver.userClassPathFirst" and
SPARK_CLASSPATH in spark-env.sh on Standalone environment, and want to confirm
this in fact has no good solution.
We are running Spark 1.5.2 in standalone mode on a cluster. Since the cluster
doesn't have the direct
>
> If we expect fields nested in structs to always be much slower than flat
> fields, then I would be keen to address that in our ETL pipeline with a
> flattening step. If it's a known issue that we expect will be fixed in
> upcoming releases, I'll hold off.
>
The difference might be even larger
Mich,
I am curious as well on how Spark casts between different types of filters.
For example: the conversions happen implicitly for 'EqualTo' filter
scala> sqlContext.sql("SELECT * from events WHERE `registration` =
> '2015-05-28'").explain()
>
> 16/04/15 11:44:15 INFO ParseDriver: Parsing
See this thread: http://search-hadoop.com/m/q3RTtsFrd61q291j1
On Fri, Apr 15, 2016 at 5:38 AM, Carlos Rojas Matas
wrote:
> Hi guys,
>
> any clue on this? Clearly the
> spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
> executors.
>
> Thanks,
>
Hi Steve,
I wrote a blog post on the configuration of Spark that we've used,
including the log4j.properties:
http://progexc.blogspot.co.il/2014/12/spark-configuration-mess-solved.html
(What we did was to distribute the relevant *log4j.properties* file to all
of the slaves to the same location)
Thanks Hyukjin for the suggestion. I will take a look at implementing Solr
datasource with CatalystScan.
Holden,
If I were to use DataSets, then I would essentially do this:
val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
for (message <- messages.asScala) {
val files =
Is that 32 CPUs or 32 cores?
So in this configuration assuming 32 cores you have I worker with how much
memory (deducting memory for OS etc) and 32 cores.
What is the ratio of memory per core in this case?
HTH
Dr Mich Talebzadeh
LinkedIn *
I wonder if anyone can confirm is Spark on YARN the problem here? Or is it
how AWS has put it together? I'm wondering if Spark on YARN has problems
with configuration files for the workers and driver?
Peter Halliday
On Thu, Apr 14, 2016 at 1:09 PM, Peter Halliday wrote:
Hi Hyukjin,
I saw that. I don’t know how to use it. I’m still learning Scala on my own. Can
you help me to start?
Thanks,
Ben
> On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon wrote:
>
> I hope it was not too late :).
>
> It is possible.
>
> Please check csvRdd api here,
>
Hi,
I'm looking for a solution to improve my Spark cluster performances, I have
read from http://spark.apache.org/docs/latest/hardware-provisioning.html:
"We recommend having 4-8 disks per node", I have tried both with one and two
disks but I have seen that with 2 disks the execution time is
This reminds me of this Jira,
https://issues.apache.org/jira/browse/SPARK-3376 and this PR,
https://github.com/apache/spark/pull/5403.
AFAIK, it is not and won't be supported.
On 2 Apr 2016 4:13 a.m., "slavitch" wrote:
> Hello;
>
> I’m working on spark with very large memory
Hi All,
I have some avro data, which I am reading in the following way.
Query :
> val data = sc.newAPIHadoopFile(file,
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]], classOf[NullWritable]).
map(_._1.datum)
But, when I try to print the data, it is generating
You can call stop() method.
> On Apr 15, 2016, at 5:21 AM, ram kumar wrote:
>
> Hi,
> I started hivecontext as,
>
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
>
> I want to stop this sql context
>
> Thanks
Hi guys,
any clue on this? Clearly the
spark.executor.extraJavaOpts=-Dlog4j.configuration is not working on the
executors.
Thanks,
-carlos.
On Wed, Apr 13, 2016 at 2:48 PM, Carlos Rojas Matas
wrote:
> Hi Yong,
>
> thanks for your response. As I said in my first email,
Hi,
I started hivecontext as,
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc);
I want to stop this sql context
Thanks
Hi All,
I am trying to access hive from Spark but getting exception
The root scratch dir: /tmp/hive on HDFS should be writable. Current
permissions are: rw-rw-rw-
Code :-
String logFile = "hdfs://hdp23ha/logs"; // Should be some file
on
Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .
I am facing strange issue .
I have a lookuo dataframe and using it join another dataframe on different
columns .
I am getting *Analysis exception* in third join.
When
Thanks Takeshi,
I did check it. I believe you are referring to this statement
"This is likely because we cast this expression weirdly to be compatible
with Hive. Specifically I think this turns into, CAST(c_date AS STRING) >=
"2016-01-01", and we don't push down casts down into data sources.
Hello,
I'm trying to make a call on whether my team should invest time added a
step to "flatten" our schema as part of our ETL pipeline to improve
performance of interactive queries.
Our data start out life as Avro before being converted to Parquet, and so
we follow the Avro idioms of creating
You could use a different format and the dataset or dataframe instead of rdd.
> On 14 Apr 2016, at 23:21, Bibudh Lahiri wrote:
>
> Hi,
> As part of a larger program, I am extracting the distinct values of some
> columns of an RDD with 100 million records and 4
32 matches
Mail list logo