[Spark SQL][reopen SPARK-16951]:Alternative implementation of NOT IN to Anti-join

2020-05-11 Thread Shuang, Linna1
Hello,

This JIRA (SPARK-16951) already being closed with the resolution of "Won't Fix" 
on 23/Feb/17.

But in TPC-H test, we met performance issue of Q16, which used NOT IN subquery 
and being translated into broadcast nested loop join. This query uses almost 
half time of total 22 queries. For example, 512GB data set, totally execution 
time is 1400 seconds, while Q16's execution time is 630 seconds.

TPC-H is a common spark sql performance benchmark, this performance issue will 
be met usually. Is it possible to reopen this JIRA and fix this issue?

Thanks,
Linna



[PySpark] Tagging descriptions

2020-05-11 Thread Rishi Shah
Hi All,

I have a tagging problem at hand where we currently use regular expressions
to tag records. Is there a recommended way to distribute & tag? Data is
about 10TB large.

-- 
Regards,

Rishi Shah


XPATH_INT behavior - XML - Function in Spark

2020-05-11 Thread Chetan Khatri
Hi Spark Users,

I want to parse xml coming in the query columns and get the value, I am
using *xpath_int* which works as per my requirement but When I am embedding
in the Spark SQL query columns it is failing.

select timesheet_profile_id,
*xpath_int(timesheet_profile_code, '(/timesheetprofile/weeks/week[*
*td.current_week**]/**td.day**)[1]')*

*this failed *
where Hardcoded values work for the above scenario

scala> spark.sql("select timesheet_profile_id,
xpath_int(timesheet_profile_code,
'(/timesheetprofile/weeks/week[2]/friday)[1]') from
TIMESHEET_PROFILE_ATT").show(false)

Anyone has worked on this? Thanks in advance.

Thanks
- Chetan


GrupState limits

2020-05-11 Thread tleilaxu
Hi,
I am tracking states in my Spark streaming application with
MapGroupsWithStateFunction described here:
https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html
Which are the limiting factors on the number of states a job can track at
the same time? Is it memory? Could be a bounded data structure in the
internal implementation? Anything else ...
You might have valuable input on this while I am trying to setup and test
this.

Thanks,
Arnold


unsubscribe

2020-05-11 Thread Nikita Goyal



Re: AnalysisException - Infer schema for the Parquet path

2020-05-11 Thread Chetan Khatri
Thanks Mich, Nilesh.
What is also working is create schema object and provide at .schema(X) in
spark.read. statement.

Thanks a lot.

On Sun, May 10, 2020 at 2:37 AM Nilesh Kuchekar 
wrote:

> Hi Chetan,
>
>   You can have a static parquet file created, and when you
> create a data frame you can pass the location of both the files, with
> option mergeSchema true. This will always fetch you a dataframe even if the
> original file is not present.
>
> Kuchekar, Nilesh
>
>
> On Sat, May 9, 2020 at 10:46 PM Mich Talebzadeh 
> wrote:
>
>> Have you tried catching error when you are creating a dataframe?
>>
>> import scala.util.{Try, Success, Failure}
>> val df = Try(spark.read.
>>  format("com.databricks.spark.xml").
>>option("rootTag", "hierarchy").
>>option("rowTag", "sms_request").
>>load("/tmp/broadcast.xml")) match {
>>case Success(df) => df
>>case Failure(exception) => throw new Exception("foo")
>>   }
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 9 May 2020 at 22:51, Chetan Khatri 
>> wrote:
>>
>>> Hi Spark Users,
>>>
>>> I've a spark job where I am reading the parquet path, and that parquet
>>> path data is generated by other systems, some of the parquet paths doesn't
>>> contains any data which is possible. is there a any way to read the parquet
>>> if no data found I can create a dummy dataframe and go ahead.
>>>
>>> One way is to check path exists like
>>>
>>>  val conf = spark.sparkContext.hadoopConfiguration
>>> val fs = org.apache.hadoop.fs.FileSystem.get(conf)
>>> val currentAreaExists = fs.exists(new
>>> org.apache.hadoop.fs.Path(consumableCurrentArea))
>>>
>>> But I don't want to check this for 300 parquets, just if data doesn't
>>> exist in the parquet path go with the dummy parquet / custom DataFrame
>>>
>>> AnalysisException: u'Unable to infer schema for Parquet. It must be
>>> specified manually.;'
>>>
>>> Thanks
>>>
>>


Regarding anomaly detection in real time streaming data

2020-05-11 Thread Hemant Garg
Hello sir,
I'm currently working on a project where i would've to detect anomalies in
real time streaming data pushing data from kafka into apache spark. I chose
to go with streaming kmeans clustering algorithm, but I couldn't find much
about it. Do you think it is a suitable algorithm to go with or should i
think of something else.


unsubscribe

2020-05-11 Thread Nikita Goyal
Please unsubscribe me.

Thanks,


Spark wrote to Hive table. file content format and fileformat in metadata doesn't match

2020-05-11 Thread 马阳阳
Hi,
We are currently trying to replace hive with Spark thrift server.
We encounter a problem. With the following sql:
create table test_db.sink_test as select [some columns] from 
test_db.test_source
After the SQL run successfully, we queried data from test_db.test_sink. The 
data is
gibberish. After some inspection, we found that test_db.test_sink has orc file 
(which can
be read with spark.read.orc) on hdfs, but the metadata for it is text. Using 
spark.read.orc().show,
the output column names are not column names from test_db.test_source, but 
something like:
|_col0|   _col1|   _col2|   _col3| _col4|
_col5|_col6|_col7|_col8|_col9|_col10|_col11|_col12|

What is mysterious is that after rerunning the SQL, without any changes, the 
table will be
alright (the file content and file format in metadata matches).

I wonder if anyone has encountered the same problem.

Appreciate for any response.

Re: [Spark SQL][Beginner] Spark throw Catalyst error while writing the dataframe in ORC format

2020-05-11 Thread Deepak Garg
Hi Jeff,

I increased the broadcast timeout. Now facing the new error.

Caused by: java.lang.InterruptedException
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
at
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:280)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:103)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
at
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.consume(BroadcastHashJoinExec.scala:39)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenOuter(BroadcastHashJoinExec.scala:345)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:103)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at
org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:35)
at
org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:65)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.consume(BroadcastHashJoinExec.scala:39)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:250)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:102)
at
org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:181)
at
org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:354)
at
org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:383)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:354)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doProduce(BroadcastHashJoinExec.scala:97)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:83)
at
org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.produce(BroadcastHashJoinExec.scala:39)
at
org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:45)
at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:88)
at