spark.sql.shuffle.partitions=auto

2024-04-30 Thread second_co...@yahoo.com.INVALID
May i know is spark.sql.shuffle.partitions=auto only available on Databricks? what about on vanilla Spark ? When i set this, it gives error need to put int.  Any open source library that auto find the best partition , block size for dataframe?

auto create event log directory if not exist

2024-04-15 Thread second_co...@yahoo.com.INVALID
Spark history server is set to use s3a, like below spark.eventLog.enabled true spark.eventLog.dir s3a://bucket-test/test-directory-log any configuration option i can set on the Spark config such that if the directory 'test-directory-log' does not exist auto create it before start Spark history

randomsplit has issue?

2024-01-31 Thread second_co...@yahoo.com.INVALID
based on this blog post https://sergei-ivanov.medium.com/why-you-should-not-use-randomsplit-in-pyspark-to-split-data-into-train-and-test-58576d539a36 , I noticed a recommendation against using randomSplit for data splitting due to data sorting. Is the information provided in the blog accurate?

Re: conver panda image column to spark dataframe

2023-08-03 Thread second_co...@yahoo.com.INVALID
se, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards,Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda dataframe has 10 rows, thus

Re: conver panda image column to spark dataframe

2023-07-31 Thread second_co...@yahoo.com.INVALID
? Because if that's the case, then you'd want ro only use 3 layers of ArrayType when you define the schema. Best regards,Adrian On Thu, Jul 27, 2023, 11:04 second_co...@yahoo.com.INVALID wrote: i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda

conver panda image column to spark dataframe

2023-07-27 Thread second_co...@yahoo.com.INVALID
i have panda dataframe with column 'image' using numpy.ndarray. shape is (500, 333, 3) per image. my panda dataframe has 10 rows, thus, shape is (10, 500, 333, 3) when using spark.createDataframe(panda_dataframe, schema), i need to specify the schema, schema = StructType([    

spark context list_packages()

2023-07-26 Thread second_co...@yahoo.com.INVALID
I ran the following code spark.sparkContext.list_packages() on spark 3.4.1 and i get below error An error was encountered: AttributeError [Traceback (most recent call last): , File "/tmp/spark-3d66c08a-08a3-4d4e-9fdf-45853f65e03d/shell_wrapper.py", line 113, in exec

cannot load model using pyspark

2023-05-23 Thread second_co...@yahoo.com.INVALID
spark.sparkContext.textFile("s3a://a_bucket/models/random_forest_zepp/bestModel/metadata", 1).getNumPartitions() when i run above code, i get below error. Can advice how to troubleshoot? i' using spark 3.3.0. the above file path exist.

Re: Tensorflow on Spark CPU

2023-04-30 Thread second_co...@yahoo.com.INVALID
second_co...@yahoo.com.INVALID wrote: Anyone successfully run native tensorflow on Spark ? i tested example at https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor   on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any speed up in training

Tensorflow on Spark CPU

2023-04-29 Thread second_co...@yahoo.com.INVALID
Anyone successfully run native tensorflow on Spark ? i tested example at https://github.com/tensorflow/ecosystem/tree/master/spark/spark-tensorflow-distributor   on Kubernetes CPU . By running in on multiple workers CPUs. I do not see any speed up in training time by setting number of slot from1

driver and executors shared same Kubernetes PVC

2023-04-28 Thread second_co...@yahoo.com.INVALID
i able to shared same PVC for spark 3.3. but on Spark 3.4 onward. i get below error.  I would like all the executors and driver to mount the same PVC. Is this a bug ? I don't want to use SPARK_EXECUTOR_ID or OnDemandOn because otherwise each of the executors will use an unique and separate PVC.

read a binary file and save in another location

2023-03-09 Thread second_co...@yahoo.com.INVALID
any example on how to read a binary file using pySpark and save it in another location . copy feature Thank you,Teoh

pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame

2023-01-12 Thread second_co...@yahoo.com.INVALID
Good day, May i know what is the different between pyspark.sql.dataframe.DataFrame versus pyspark.pandas.frame.DataFrame ? Are both store in Spark dataframe format? I'm looking for a way to load a huge excel file (4-10GB), i wonder should i use third party library spark-excel or just use

cannot write spark log to s3a

2022-11-09 Thread second_co...@yahoo.com.INVALID
when running spark job, i used "spark.eventLog.dir": "s3a://_some_bucket_on_prem/spark-history",  "spark.eventLog.enabled": true i see the log of the job shows 22/11/10 06:42:30 INFO SingleEventLogFileWriter: Logging events to

Re: pyspark connect to spark thrift server port

2022-10-20 Thread second_co...@yahoo.com.INVALID
+Metastore+3.0+Administration). On 10/20/22 4:31 AM, second_co...@yahoo.com.INVALID wrote: Currently my pyspark code able to connect to hive metastore at port 9083. However using this approach i can't put in-place any security mechanism like LDAP and sql authentication control. Is there anyway

pyspark connect to spark thrift server port

2022-10-20 Thread second_co...@yahoo.com.INVALID
Currently my pyspark code able to connect to hive metastore at port 9083. However using this approach i can't put in-place any security mechanism like LDAP and sql authentication control. Is there anyway to connect from pyspark to spark thrift server on port 1 without exposing hive