log transfering into hadoop/spark

2022-08-01 Thread pengyh

since flume is not continued to develop.
what's the current opensource tool to transfer webserver logs into
hdfs/spark?

thank you.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Kumba Janga
Thanks Sean! That was a simple fix. I changed it to "Create or Replace
Table" but now I am getting the following error. I am still researching
solutions but so far no luck.

ParseException:
mismatched input '' expecting {'ADD', 'AFTER', 'ALL', 'ALTER',
'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC',
'AT', 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY',
'CACHE', 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR',
'CLUSTER', 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN',
'COLUMNS', 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE',
'CONCATENATE', 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE',
'CURRENT', 'CURRENT_DATE', 'CURRENT_TIME', 'CURRENT_TIMESTAMP',
'CURRENT_USER', 'DATA', 'DATABASE', DATABASES, 'DBPROPERTIES',
'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 'DESCRIBE', 'DFS',
'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 'DIV', 'DROP',
'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS',
'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE',
'FETCH', 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING',
'FOR', 'FOREIGN', 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION',
'FUNCTIONS', 'GLOBAL', 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'IF',
'IGNORE', 'IMPORT', 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH',
'INPUTFORMAT', 'INSERT', 'INTERSECT', 'INTERVAL', 'INTO', 'IS',
'ITEMS', 'JOIN', 'KEYS', 'LAST', 'LATERAL', 'LAZY', 'LEADING', 'LEFT',
'LIKE', 'LIMIT', 'LINES', 'LIST', 'LOAD', 'LOCAL', 'LOCATION', 'LOCK',
'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 'MATCHED', 'MERGE', 'MSCK',
'NAMESPACE', 'NAMESPACES', 'NATURAL', 'NO', NOT, 'NULL', 'NULLS',
'OF', 'ON', 'ONLY', 'OPTION', 'OPTIONS', 'OR', 'ORDER', 'OUT',
'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 'OVERLAY', 'OVERWRITE',
'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 'PIVOT',
'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS',
'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER',
'RECORDWRITER', 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH',
'RENAME', 'REPAIR', 'REPLACE', 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT',
RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA',
'SELECT', 'SEMI', 'SEPARATED', 'SERDE', 'SERDEPROPERTIES',
'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 'SKEWED', 'SOME',
'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 'STRATIFY',
'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 'TABLESAMPLE',
'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 'TOUCH',
'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM',
'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE',
'UNION', 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE',
'USER', 'USING', 'VALUES', 'VIEW', 'VIEWS', 'WHEN', 'WHERE', 'WINDOW',
'WITH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 23)

== SQL ==
CREATE OR REPLACE TABLE


On Mon, Aug 1, 2022 at 8:32 PM Sean Owen  wrote:

> Pretty much what it says? you are creating a table over a path that
> already has data in it. You can't do that without mode=overwrite at least,
> if that's what you intend.
>
> On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:
>
>>
>>
>>- Component: Spark Delta, Spark SQL
>>- Level: Beginner
>>- Scenario: Debug, How-to
>>
>> *Python in Jupyter:*
>>
>> import pyspark
>> import pyspark.sql.functions
>>
>> from pyspark.sql import SparkSession
>> spark = (
>> SparkSession
>> .builder
>> .appName("programming")
>> .master("local")
>> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
>> .config("spark.sql.extensions", 
>> "io.delta.sql.DeltaSparkSessionExtension")
>> .config("spark.sql.catalog.spark_catalog", 
>> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
>> .config('spark.ui.port', '4050')
>> .getOrCreate()
>>
>> )
>> from delta import *
>>
>> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
>> 2021-06-09,1001,Y,7
>> 2021-06-09,1002,Y,3.75
>> 2021-06-09,1003,Y,7.5
>> 2021-06-09,1004,Y,6.25'''
>>
>> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>>
>> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE 
>> THEM
>> OUTPUT_DELTA_PATH = './output/delta/'
>>
>> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>>
>> spark.sql('''
>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
>> worked_date date
>> , worker_id int
>> , delete_flag string
>> , hours_worked double
>> ) USING DELTA
>>
>>
>> PARTITIONED BY (worked_date)
>> LOCATION "{0}"
>> '''.format(OUTPUT_DELTA_PATH)
>> )
>>
>> *Error Message:*
>>
>> AnalysisException Traceback (most recent call 
>> last) in   4 spark.sql('CREATE 
>> DATABASE IF NOT EXISTS EXERCISE')  5 > 6 spark.sql('''  7 
>> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(  8 worked_date 
>> date
>> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py

Re: [pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Sean Owen
Pretty much what it says? you are creating a table over a path that already
has data in it. You can't do that without mode=overwrite at least, if
that's what you intend.

On Mon, Aug 1, 2022 at 7:29 PM Kumba Janga  wrote:

>
>
>- Component: Spark Delta, Spark SQL
>- Level: Beginner
>- Scenario: Debug, How-to
>
> *Python in Jupyter:*
>
> import pyspark
> import pyspark.sql.functions
>
> from pyspark.sql import SparkSession
> spark = (
> SparkSession
> .builder
> .appName("programming")
> .master("local")
> .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
> .config("spark.sql.extensions", 
> "io.delta.sql.DeltaSparkSessionExtension")
> .config("spark.sql.catalog.spark_catalog", 
> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
> .config('spark.ui.port', '4050')
> .getOrCreate()
>
> )
> from delta import *
>
> string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
> 2021-06-09,1001,Y,7
> 2021-06-09,1002,Y,3.75
> 2021-06-09,1003,Y,7.5
> 2021-06-09,1004,Y,6.25'''
>
> rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))
>
> # FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO BROWSE 
> THEM
> OUTPUT_DELTA_PATH = './output/delta/'
>
> spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')
>
> spark.sql('''
> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
> worked_date date
> , worker_id int
> , delete_flag string
> , hours_worked double
> ) USING DELTA
>
>
> PARTITIONED BY (worked_date)
> LOCATION "{0}"
> '''.format(OUTPUT_DELTA_PATH)
> )
>
> *Error Message:*
>
> AnalysisException Traceback (most recent call 
> last) in   4 spark.sql('CREATE 
> DATABASE IF NOT EXISTS EXERCISE')  5 > 6 spark.sql('''  7 
> CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(  8 worked_date 
> date
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py in 
> sql(self, sqlQuery)647 [Row(f1=1, f2=u'row1'), Row(f1=2, 
> f2=u'row2'), Row(f1=3, f2=u'row3')]648 """--> 649 return 
> DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)650 651
>  @since(2.0)
> \Users\kyjan\spark-3.0.3-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py
>  in __call__(self, *args)   13021303 answer = 
> self.gateway_client.send_command(command)-> 1304 return_value = 
> get_return_value(   1305 answer, self.gateway_client, 
> self.target_id, self.name)   1306
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in 
> deco(*a, **kw)132 # Hide where the exception came from 
> that shows a non-Pythonic133 # JVM exception message.--> 
> 134 raise_from(converted)135 else:136 
> raise
> /Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in 
> raise_from(e)
> AnalysisException: Cannot create table ('`EXERCISE`.`WORKED_HOURS`'). The 
> associated location ('output/delta') is not empty.;
>
>
> --
> Best Wishes,
> Kumba Janga
>
> "The only way of finding the limits of the possible is by going beyond
> them into the impossible"
> -Arthur C. Clarke
>


[pyspark delta] [delta][Spark SQL]: Getting an Analysis Exception. The associated location (path) is not empty

2022-08-01 Thread Kumba Janga
   - Component: Spark Delta, Spark SQL
   - Level: Beginner
   - Scenario: Debug, How-to

*Python in Jupyter:*

import pyspark
import pyspark.sql.functions

from pyspark.sql import SparkSession
spark = (
SparkSession
.builder
.appName("programming")
.master("local")
.config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0")
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
.config('spark.ui.port', '4050')
.getOrCreate()

)
from delta import *

string_20210609 = '''worked_date,worker_id,delete_flag,hours_worked
2021-06-09,1001,Y,7
2021-06-09,1002,Y,3.75
2021-06-09,1003,Y,7.5
2021-06-09,1004,Y,6.25'''

rdd_20210609 = spark.sparkContext.parallelize(string_20210609.split('\n'))

# FILES WILL SHOW UP ON THE LEFT UNDER THE FOLDER ICON IF YOU WANT TO
BROWSE THEM
OUTPUT_DELTA_PATH = './output/delta/'

spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')

spark.sql('''
CREATE TABLE IF NOT EXISTS EXERCISE.WORKED_HOURS(
worked_date date
, worker_id int
, delete_flag string
, hours_worked double
) USING DELTA


PARTITIONED BY (worked_date)
LOCATION "{0}"
'''.format(OUTPUT_DELTA_PATH)
)

*Error Message:*

AnalysisException Traceback (most recent call
last) in   4
spark.sql('CREATE DATABASE IF NOT EXISTS EXERCISE')  5 > 6
spark.sql('''  7 CREATE TABLE IF NOT EXISTS
EXERCISE.WORKED_HOURS(  8 worked_date date
/Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\session.py
in sql(self, sqlQuery)647 [Row(f1=1, f2=u'row1'),
Row(f1=2, f2=u'row2'), Row(f1=3, f2=u'row3')]648 """-->
649 return DataFrame(self._jsparkSession.sql(sqlQuery),
self._wrapped)650 651 @since(2.0)
\Users\kyjan\spark-3.0.3-bin-hadoop2.7\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py
in __call__(self, *args)   13021303 answer =
self.gateway_client.send_command(command)-> 1304 return_value
= get_return_value(   1305 answer, self.gateway_client,
self.target_id, self.name)   1306
/Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in
deco(*a, **kw)132 # Hide where the exception came
from that shows a non-Pythonic133 # JVM exception
message.--> 134 raise_from(converted)135
  else:136 raise
/Users/kyjan/spark-3.0.3-bin-hadoop2.7\python\pyspark\sql\utils.py in
raise_from(e)
AnalysisException: Cannot create table ('`EXERCISE`.`WORKED_HOURS`').
The associated location ('output/delta') is not empty.;


-- 
Best Wishes,
Kumba Janga

"The only way of finding the limits of the possible is by going beyond them
into the impossible"
-Arthur C. Clarke


Re: WARN: netlib.BLAS

2022-08-01 Thread Sean Owen
Hm, I think the problem is either that you need to build the
spark-ganglia-lgpl module in your Spark distro, or the pomOnly() part of
your build. You need the code in your app.
Yes you need openblas too.

On Mon, Aug 1, 2022 at 7:36 AM 陈刚  wrote:

> Dear export,
>
>
> I'm using spark-3.1.1 mllib, and I got this on CentOS 7.6:
>
>
> 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeSystemBLAS
> 22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from:
> com.github.fommil.netlib.NativeRefBLAS
>
>
> I used
>
> yum -y install openblas*
>
> yum -y install blas
>
> to install blas
>
> and added
>
>
> // https://mvnrepository.com/artifact/com.github.fommil.netlib/all
> libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2"
> pomOnly()
>
> in the sbt file.
>
>
> But I still got the WARN.
>
>
> Please help me!
>
>
> Best
>
>
> Gang Chen
>


WARN: netlib.BLAS

2022-08-01 Thread 陈刚
Dear export,




I'm using spark-3.1.1 mllib, and I got this on CentOS 7.6:




22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
22/08/01 09:42:34 WARN netlib.BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS





I used 

yum -y install openblas*

yum -y install blas 

to install blas

and added 




// https://mvnrepository.com/artifact/com.github.fommil.netlib/all
libraryDependencies += "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly()


in the sbt file.




But I still got the WARN.




Please help me!




Best




Gang Chen

Re: Use case idea

2022-08-01 Thread pengyh



* streaming handler is still useful for spark, though there is flink as
alternative
* RDD is also useful for transform especially for non-structure data
* there are many SQL products in market like Drill/Impala, but spark is
more powerful for distributed deployment as far as I know
* we never used spark for AI training, but use keras/pytorch which are
pretty easy for development a model.


Perhaps you should try other systems in the market first, that will give
an unbiased view of databricks and SPARK being just over
glamourised tool. The hope of extending SPARK with a separate easy to
use query engine for deep learning and other AI systems is gone now with
Ray, SPARK community now just defends the lack of support, and direction
in this matter largely, which is a joke.



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Use case idea

2022-08-01 Thread Gourav Sengupta
Hi,

my comments were for purposes of SQL, also most of other technologies like
snowflake, and Redshift, and using KSQL directly to other sinks quite
easily, without massive engineering, infact databricks is trying to play a
catchup game in this market by coming out with GIU based ETL tools :)

Perhaps you should try other systems in the market first, that will give an
unbiased view of databricks and SPARK being just over glamourised tool. The
hope of extending SPARK with a separate easy to use query engine for deep
learning and other AI systems is gone now with Ray, SPARK community now
just defends the lack of support, and direction in this matter largely,
which is a joke.

Thanks and Regards,
Gourav Sengupta

On Mon, Aug 1, 2022 at 4:54 AM pengyh  wrote:

>
> I don't think so. we were using spark integarted with Kafka for
> streaming computing and realtime reports. that just works.
>
>
> > SPARK is now just an overhyped and overcomplicated ETL tool, nothing
> > more, there is another distributed AI called as Ray, which should be the
> > next billion dollar company instead of just building those features in
> > SPARK natively using a different computation engine :)
>


Re: unsubscribe

2022-08-01 Thread pengyh

you could be able to unsubscribe yourself by using the signature below.



To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



unsubscribe

2022-08-01 Thread Martin Soch




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org