Re: Spark Sql group by less performant

2018-12-10 Thread Georg Heiler
See
https://databricks.com/blog/2016/05/19/approximate-algorithms-in-apache-spark-hyperloglog-and-quantiles.html
you most probably do not require exact counts.

Am Di., 11. Dez. 2018 um 02:09 Uhr schrieb 15313776907 <15313776...@163.com
>:

> i think you can add executer memory
>
> 15313776907
> 邮箱:15313776...@163.com
>
> 
>
> 签名由 网易邮箱大师  定制
>
> On 12/11/2018 08:28, lsn24  wrote:
> Hello,
>
> I have a requirement where I need to get total count of rows and total
> count of failedRows based on a grouping.
>
> The code looks like below:
>
> myDataset.createOrReplaceTempView("temp_view");
>
> Dataset  countDataset = sparkSession.sql("Select
> column1,column2,column3,column4,column5,column6,column7,column8, count(*)
> as
> totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as
> failedRows
> from temp_view group by
> column1,column2,column3,column4,column5,column6,column7,column8");
>
>
> Up till around 50 Million records,  the query performance was ok. After
> that
> it gave it up. Mostly resulting in out of Memory exception.
>
> I read documentation and blogs, most of them gives me examples of
> RDD.reduceByKey. But here I got dataset and spark Sql.
>
> What  am I missing here ? .
>
> Any help will be appreciated.
>
> Thanks!
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Spark Sql group by less performant

2018-12-10 Thread 15313776907
i think you can add executer memory


| |
15313776907
|
|
邮箱:15313776...@163.com
|

签名由 网易邮箱大师 定制

On 12/11/2018 08:28, lsn24 wrote:
Hello,

I have a requirement where I need to get total count of rows and total
count of failedRows based on a grouping.

The code looks like below:

myDataset.createOrReplaceTempView("temp_view");

Dataset  countDataset = sparkSession.sql("Select
column1,column2,column3,column4,column5,column6,column7,column8, count(*) as
totalRows, sum(CASE WHEN (column8 is NULL) THEN 1 ELSE 0 END) as failedRows
from temp_view group by
column1,column2,column3,column4,column5,column6,column7,column8");


Up till around 50 Million records,  the query performance was ok. After that
it gave it up. Mostly resulting in out of Memory exception.

I read documentation and blogs, most of them gives me examples of
RDD.reduceByKey. But here I got dataset and spark Sql.

What  am I missing here ? .

Any help will be appreciated.

Thanks!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: spark sql - group by constant column

2015-07-15 Thread Lior Chaga
I found out the problem. Grouping by a constant column value is indeed
impossible.
The reason it was "working" in my project is that I gave the constant
column an alias that exists in the schema of the dataframe. The dataframe
contained a "data_timestamp" representing an hour, and I added to the
select a constant "data_timestamp" that represented the timestamp of the
day. And that was the cause for my original bug - I thought I was grouping
by the day timestamp, when I was actually grouping by each hour, and
therefore I got multiple rows for each of the group by combinations.

On Wed, Jul 15, 2015 at 10:09 AM, Lior Chaga  wrote:

> Hi,
>
> Facing a bug with group by in SparkSQL (version 1.4).
> Registered a JavaRDD with object containing integer fields as a table.
>
> Then I'm trying to do a group by, with a constant value in the group by
> fields:
>
> SELECT primary_one, primary_two, 10 as num, SUM(measure) as total_measures
> FROM tbl
> GROUP BY primary_one, primary_two, num
>
>
> I get the following exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'num' given input
> columns measure, primary_one, primary_two
>
> Tried both with HiveContext and SqlContext.
> The odd thing is that this kind of query actually works for me in a
> project I'm working on, but I have there another bug (the group by does not
> yield expected results).
>
> The only reason I can think of is that maybe in my real project, the
> context configuration is different.
> In my above example the configuration of the HiveContext is empty.
>
> In my real project, the configuration is shown below.
> Any ideas?
>
> Thanks,
> Lior
>
> Hive context configuration in project:
> "(mapreduce.jobtracker.jobhistory.task.numberprogresssplits,12)"
> "(nfs3.mountd.port,4242)"
> "(mapreduce.tasktracker.healthchecker.script.timeout,60)"
> "(yarn.app.mapreduce.am.scheduler.heartbeat.interval-ms,1000)"
> "(mapreduce.input.fileinputformat.input.dir.recursive,false)"
> "(hive.orc.compute.splits.num.threads,10)"
>
> "(mapreduce.job.classloader.system.classes,java.,javax.,org.apache.commons.logging.,org.apache.log4j.,org.apache.hadoop.)"
> "(hive.auto.convert.sortmerge.join.to.mapjoin,false)"
> "(hadoop.http.authentication.kerberos.principal,HTTP/_HOST@LOCALHOST)"
> "(hive.exec.perf.logger,org.apache.hadoop.hive.ql.log.PerfLogger)"
>  "(hive.mapjoin.lazy.hashtable,true)"
>  "(mapreduce.framework.name,local)"
>  "(hive.exec.script.maxerrsize,10)"
>  "(dfs.namenode.checkpoint.txns,100)"
>  "(tfile.fs.output.buffer.size,262144)"
>  "(yarn.app.mapreduce.am.job.task.listener.thread-count,30)"
>  "(mapreduce.tasktracker.local.dir.minspacekill,0)"
>  "(hive.support.concurrency,false)"
>  "(fs.s3.block.size,67108864)"
>
>  "(hive.script.recordwriter,org.apache.hadoop.hive.ql.exec.TextRecordWriter)"
>  "(hive.stats.retries.max,0)"
>  "(hadoop.hdfs.configuration.version,1)"
>  "(dfs.bytes-per-checksum,512)"
>  "(fs.s3.buffer.dir,${hadoop.tmp.dir}/s3)"
>  "(mapreduce.job.acl-view-job, )"
>  "(hive.typecheck.on.insert,true)"
>  "(mapreduce.jobhistory.loadedjobs.cache.size,5)"
>  "(mapreduce.jobtracker.persist.jobstatus.hours,1)"
>  "(hive.unlock.numretries,10)"
>  "(dfs.namenode.handler.count,10)"
>  "(mapreduce.input.fileinputformat.split.minsize,1)"
>  "(hive.plan.serialization.format,kryo)"
>  "(dfs.datanode.failed.volumes.tolerated,0)"
>  "(yarn.resourcemanager.container.liveness-monitor.interval-ms,60)"
>  "(yarn.resourcemanager.amliveliness-monitor.interval-ms,1000)"
>  "(yarn.resourcemanager.client.thread-count,50)"
>  "(io.seqfile.compress.blocksize,100)"
>  "(mapreduce.tasktracker.http.threads,40)"
>  "(hive.explain.dependency.append.tasktype,false)"
>  "(dfs.namenode.retrycache.expirytime.millis,60)"
>  "(dfs.namenode.backup.address,0.0.0.0:50100)"
>  "(hive.hwi.listen.host,0.0.0.0)"
>  "(dfs.datanode.data.dir,file://${hadoop.tmp.dir}/dfs/data)"
>  "(dfs.replication,3)"
>  "(mapreduce.jobtracker.jobhistory.block.size,3145728)"
>
>  
> "(dfs.secondary.namenode.kerberos.internal.spnego.principal,${dfs.web.authentication.kerberos.principal})"
>  "(mapreduce.task.profile.maps,0-2)"
>  "(fs.har.impl,org.apache.hadoop.hive.shims.HiveHarFileSystem)"
>  "(hive.stats.reliable,false)"
>  "(yarn.nodemanager.admin-env,MALLOC_ARENA_MAX=$MALLOC_ARENA_MAX)"
>
>


Re: Spark SQL group by

2015-02-06 Thread Mohnish Kodnani
Doh :) Thanks.. seems like brain freeze.


On Fri, Feb 6, 2015 at 3:22 PM, Michael Armbrust 
wrote:

> You can't use columns (timestamp) that aren't in the GROUP BY clause.
> Spark 1.2+ give you a better error message for this case.
>
> On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani  > wrote:
>
>> Hi,
>> i am trying to issue a sql query against a parquet file and am getting
>> errors and would like some help to figure out what is going on.
>>
>> The sql :
>> select timestamp, count(rid), qi.clientname from records where timestamp
>> > 0 group by qi.clientname
>>
>> I am getting the following error:
>> *org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
>> attribute, tree: timestamp#0L*
>> at
>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
>> at
>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
>> at
>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
>> at
>> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
>> at
>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
>> at
>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at
>> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>> at scala.collection.immutable.List.foreach(List.scala:318)
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>> at
>> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:52)
>> at
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.(Aggregate.scala:176)
>> at
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
>> at
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>> *Caused by: java.lang.RuntimeException: Couldn't find timestamp#0L in
>> [aggResult:SUM(PartialCount#14L)#17L,clientName#11]*
>> at scala.sys.package$.error(package.scala:27)
>> at
>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:46)
>> at
>> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:43)
>> at
>> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>>
>>
>


Re: Spark SQL group by

2015-02-06 Thread Michael Armbrust
You can't use columns (timestamp) that aren't in the GROUP BY clause.
Spark 1.2+ give you a better error message for this case.

On Fri, Feb 6, 2015 at 3:12 PM, Mohnish Kodnani 
wrote:

> Hi,
> i am trying to issue a sql query against a parquet file and am getting
> errors and would like some help to figure out what is going on.
>
> The sql :
> select timestamp, count(rid), qi.clientname from records where timestamp >
> 0 group by qi.clientname
>
> I am getting the following error:
> *org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding
> attribute, tree: timestamp#0L*
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> at
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43)
> at
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:156)
> at
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:42)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection$$anonfun$$init$$2.apply(Projection.scala:52)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(Projection.scala:52)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7$$anon$1.(Aggregate.scala:176)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:172)
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.sql.SchemaRDD.compute(SchemaRDD.scala:115)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> *Caused by: java.lang.RuntimeException: Couldn't find timestamp#0L in
> [aggResult:SUM(PartialCount#14L)#17L,clientName#11]*
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:46)
> at
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:43)
> at
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:46)
>
>