GrupState limits

2020-05-11 Thread tleilaxu
Hi, I am tracking states in my Spark streaming application with MapGroupsWithStateFunction described here: https://spark.apache.org/docs/2.4.0/api/java/org/apache/spark/sql/streaming/GroupState.html Which are the limiting factors on the number of states a job can track at the same time? Is it

[PySpark] Tagging descriptions

2020-05-11 Thread Rishi Shah
Hi All, I have a tagging problem at hand where we currently use regular expressions to tag records. Is there a recommended way to distribute & tag? Data is about 10TB large. -- Regards, Rishi Shah

XPATH_INT behavior - XML - Function in Spark

2020-05-11 Thread Chetan Khatri
Hi Spark Users, I want to parse xml coming in the query columns and get the value, I am using *xpath_int* which works as per my requirement but When I am embedding in the Spark SQL query columns it is failing. select timesheet_profile_id, *xpath_int(timesheet_profile_code,

unsubscribe

2020-05-11 Thread Nikita Goyal

[Spark SQL][reopen SPARK-16951]:Alternative implementation of NOT IN to Anti-join

2020-05-11 Thread Shuang, Linna1
Hello, This JIRA (SPARK-16951) already being closed with the resolution of "Won't Fix" on 23/Feb/17. But in TPC-H test, we met performance issue of Q16, which used NOT IN subquery and being translated into broadcast nested loop join. This query uses almost half time of total 22 queries. For

Re: [Spark SQL][Beginner] Spark throw Catalyst error while writing the dataframe in ORC format

2020-05-11 Thread Deepak Garg
Hi Jeff, I increased the broadcast timeout. Now facing the new error. Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1039) at

unsubscribe

2020-05-11 Thread Nikita Goyal
Please unsubscribe me. Thanks,

Spark wrote to Hive table. file content format and fileformat in metadata doesn't match

2020-05-11 Thread 马阳阳
Hi, We are currently trying to replace hive with Spark thrift server. We encounter a problem. With the following sql: create table test_db.sink_test as select [some columns] from test_db.test_source After the SQL run successfully, we queried data from test_db.test_sink. The data is

Re: AnalysisException - Infer schema for the Parquet path

2020-05-11 Thread Chetan Khatri
Thanks Mich, Nilesh. What is also working is create schema object and provide at .schema(X) in spark.read. statement. Thanks a lot. On Sun, May 10, 2020 at 2:37 AM Nilesh Kuchekar wrote: > Hi Chetan, > > You can have a static parquet file created, and when you > create a data

Regarding anomaly detection in real time streaming data

2020-05-11 Thread Hemant Garg
Hello sir, I'm currently working on a project where i would've to detect anomalies in real time streaming data pushing data from kafka into apache spark. I chose to go with streaming kmeans clustering algorithm, but I couldn't find much about it. Do you think it is a suitable algorithm to go with