log transfering into hadoop/spark
since flume is not continued to develop. what's the current opensource tool to transfer webserver logs into hdfs/spark? thank you. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty
Thanks Sean! That was a simple fix. I changed it to "Create or Replace Table" but now I am getting the following error. I am still researching solutions but so far no luck. ParseException: mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER', 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF', 'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK', 'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW', 'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23) == SQL == CREATE OR REPLACE TABLE On Mon, Aug 1, 2022 at 8:32 PM Sean Owen wrote: > Pretty much what it says? you are creating a table over a path that > already has data in it. You can't do that without mode=overwrite at least, > if that's what you intend. > > On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga wrote: > >> >> >>- Component: Spark Delta, Spark SQL >>- Level: Beginner >>- Scenario: Debug, How-to >> >> *Python in Jupyter:* >> >> import pyspark >> import pyspark.sql.functions >> >> from pyspark.sql import SparkSession >> spark = ( >> SparkSession >> .builder >> .appName("programming") >> .master("local") >> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") >> .config("spark.sql.extensions", >> "io.delta.sql.DeltaSparkSessionExtension") >> .config("spark.sql.catalog.spark_catalog", >> "org.apache.spark.sql.delta.catalog.DeltaCatalog") >> .config('spark.ui.port', '4050') >> .getOrCreate() >> >> ) >> from delta import * >> >> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked >> 2021-06-09,1001,Y,7 >> 2021-06-09,1002,Y,3.75 >> 2021-06-09,1003,Y,7.5 >> 2021-06-09,1004,Y,6.25''' >> >> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n')) >> >> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE >> THEM >> OUTPUT_DELTA_PATH = './output/delta/' >> >> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE') >> >> spark.sql(''' >> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( >> worked_date date >> , worker_id int >> , delete_flag string >> , hours_worked double >> ) USING DELTA >> >> >> PARTITIONED BY (worked_date) >> LOCATION "{0}" >> '''.format(OUTPUT_DELTA_PATH) >> ) >> >> *Error Message:* >> >> AnalysisException Traceback (most recent call >> last) in 4 spark.sql('CREATE >> DATABASE IF NOT EXISTS EXERCISE') 5 > 6 spark.sql(''' 7 >> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( 8 worked_date >> date >> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py
Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty
Pretty much what it says? you are creating a table over a path that already has data in it. You can't do that without mode=overwrite at least, if that's what you intend. On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga wrote: > > >- Component: Spark Delta, Spark SQL >- Level: Beginner >- Scenario: Debug, How-to > > *Python in Jupyter:* > > import pyspark > import pyspark.sql.functions > > from pyspark.sql import SparkSession > spark = ( > SparkSession > .builder > .appName("programming") > .master("local") > .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") > .config("spark.sql.extensions", > "io.delta.sql.DeltaSparkSessionExtension") > .config("spark.sql.catalog.spark_catalog", > "org.apache.spark.sql.delta.catalog.DeltaCatalog") > .config('spark.ui.port', '4050') > .getOrCreate() > > ) > from delta import * > > string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked > 2021-06-09,1001,Y,7 > 2021-06-09,1002,Y,3.75 > 2021-06-09,1003,Y,7.5 > 2021-06-09,1004,Y,6.25''' > > rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n')) > > # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE > THEM > OUTPUT_DELTA_PATH = './output/delta/' > > spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE') > > spark.sql(''' > CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( > worked_date date > , worker_id int > , delete_flag string > , hours_worked double > ) USING DELTA > > > PARTITIONED BY (worked_date) > LOCATION "{0}" > '''.format(OUTPUT_DELTA_PATH) > ) > > *Error Message:* > > AnalysisException Traceback (most recent call > last) in 4 spark.sql('CREATE > DATABASE IF NOT EXISTS EXERCISE') 5 > 6 spark.sql(''' 7 > CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( 8 worked_date > date > /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py in > sql(self, sqlQuery)647 [Row(f1=1, f2=u'row1'), Row(f1=2, > f2=u'row2'), Row(f1=3, f2=u'row3')]648 """--> 649 return > DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)650 651 > @since(2.0) > \Users\kyjan\spark-3.0.3-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py > in __call__(self, *args) 13021303 answer = > self.gateway_client.send_command(command)-> 1304 return_value = > get_return_value( 1305 answer, self.gateway_client, > self.target_id, self.name) 1306 > /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in > deco(*a, **kw)132 # Hide where the exception came from > that shows a non-Pythonic133 # JVM exception message.--> > 134 raise_from(converted)135 else:136 > raise > /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in > raise_from(e) > AnalysisException: Cannot create table ('`EXERCISE`.`WORKED_HOURS`'). The > associated location ('output/delta') is not empty.; > > > -- > Best Wishes, > Kumba Janga > > "The only way of finding the limits of the possible is by going beyond > them into the impossible" > -Arthur C. Clarke >
[pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty
- Component: Spark Delta, Spark SQL - Level: Beginner - Scenario: Debug, How-to *Python in Jupyter:* import pyspark import pyspark.sql.functions from pyspark.sql import SparkSession spark = ( SparkSession .builder .appName("programming") .master("local") .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") .config('spark.ui.port', '4050') .getOrCreate() ) from delta import * string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked 2021-06-09,1001,Y,7 2021-06-09,1002,Y,3.75 2021-06-09,1003,Y,7.5 2021-06-09,1004,Y,6.25''' rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n')) # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE THEM OUTPUT_DELTA_PATH = './output/delta/' spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE') spark.sql(''' CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( worked_date date , worker_id int , delete_flag string , hours_worked double ) USING DELTA PARTITIONED BY (worked_date) LOCATION "{0}" '''.format(OUTPUT_DELTA_PATH) ) *Error Message:* AnalysisException Traceback (most recent call last) in 4 spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE') 5 > 6 spark.sql(''' 7 CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS( 8 worked_date date /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py in sql(self, sqlQuery)647 [Row(f1=1, f2=u'row1'), Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]648 """--> 649 return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)650 651 @since(2.0) \Users\kyjan\spark-3.0.3-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py in __call__(self, *args) 13021303 answer = self.gateway_client.send_command(command)-> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306 /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in deco(*a, **kw)132 # Hide where the exception came from that shows a non-Pythonic133 # JVM exception message.--> 134 raise_from(converted)135 else:136 raise /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in raise_from(e) AnalysisException: Cannot create table ('`EXERCISE`.`WORKED_HOURS`'). The associated location ('output/delta') is not empty.; -- Best Wishes, Kumba Janga "The only way of finding the limits of the possible is by going beyond them into the impossible" -Arthur C. Clarke
Re: WARN: netlib.BLAS
Hm, I think the problem is either that you need to build the spark-ganglia-lgpl module in your Spark distro, or the pomOnly() part of your build. You need the code in your app. Yes you need openblas too. On Mon, Aug 1, 2022 at 7:36 AM 陈刚 wrote: > Dear export, > > > I'm using spark-3.1.1 mllib, and I got this on CentOS 7.6: > > > 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: > com.github.fommil.netlib.NativeSystemBLAS > 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: > com.github.fommil.netlib.NativeRefBLAS > > > I used > > yum -y install openblas* > > yum -y install blas > > to install blas > > and added > > > // https://mvnrepository.com/artifact/com.github.fommil.netlib/all > libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2" > pomOnly() > > in the sbt file. > > > But I still got the WARN. > > > Please help me! > > > Best > > > Gang Chen >
WARN: netlib.BLAS
Dear export, I'm using spark-3.1.1 mllib, and I got this on CentOS 7.6: 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS I used yum -y install openblas* yum -y install blas to install blas and added // https://mvnrepository.com/artifact/com.github.fommil.netlib/all libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly() in the sbt file. But I still got the WARN. Please help me! Best Gang Chen
Re: Use case idea
* streaming handler is still useful for spark, though there is flink as alternative * RDD is also useful for transform especially for non-structure data * there are many SQL products in market like Drill/Impala, but spark is more powerful for distributed deployment as far as I know * we never used spark for AI training, but use keras/pytorch which are pretty easy for development a model. Perhaps you should try other systems in the market first, that will give an unbiased view of databricks and SPARK being just over glamourised tool. The hope of extending SPARK with a separate easy to use query engine for deep learning and other AI systems is gone now with Ray, SPARK community now just defends the lack of support, and direction in this matter largely, which is a joke. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Use case idea
Hi, my comments were for purposes of SQL, also most of other technologies like snowflake, and Redshift, and using KSQL directly to other sinks quite easily, without massive engineering, infact databricks is trying to play a catchup game in this market by coming out with GIU based ETL tools :) Perhaps you should try other systems in the market first, that will give an unbiased view of databricks and SPARK being just over glamourised tool. The hope of extending SPARK with a separate easy to use query engine for deep learning and other AI systems is gone now with Ray, SPARK community now just defends the lack of support, and direction in this matter largely, which is a joke. Thanks and Regards, Gourav Sengupta On Mon, Aug 1, 2022 at 4:54 AM pengyh wrote: > > I don't think so. we were using spark integarted with Kafka for > streaming computing and realtime reports. that just works. > > > > SPARK is now just an overhyped and overcomplicated ETL tool, nothing > > more, there is another distributed AI called as Ray, which should be the > > next billion dollar company instead of just building those features in > > SPARK natively using a different computation engine :) >
Re: unsubscribe
you could be able to unsubscribe yourself by using the signature below. To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
unsubscribe
- To unsubscribe e-mail: user-unsubscr...@spark.apache.org