Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it? ‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ <

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Yeah but can’t you use following?1 . For data files: My/path/part-2. For partitioned data: my/path/partition=Best regardsOn 13 Apr 2023, at 12:58, Yuval Itzchakov wrote:The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :( https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 "You might specify an S3 Lifecycle configuration in

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

_spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Hi everyone, I am using Sparks FileStreamSink in order to write files to S3. On the S3 bucket, I have a lifecycle policy that deletes data older than X days back from the bucket in order for it to not infinitely grow. My problem starts with Spark jobs that don't have frequent data. What will

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Maytas Monsereenusorn
Hi, I was wondering if it's not possible to determine tasks to functions, is it still possible to easily figure out which job and stage completed which part of the query from the UI? For example, in the SQL tab of the Spark UI, I am able to see the query and the Job IDs for that query. However,

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master

Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Hi, In my spark-submit to eks cluster, I use the standard code to submit to the cluster as below: spark-submit --verbose \ --master k8s://$KUBERNETES_MASTER_IP:443 \ --deploy-mode cluster \ --name sparkOnEks \ --py-files local://$CODE_DIRECTORY/spark_on_eks.zip \

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
Hi Lingzhe Sun, Thanks for your comments. I am afraid I won't be able to take part in this project and contribute. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
Hi, tl;dr it's not possible to "reverse-engineer" tasks to functions. In essence, Spark SQL is an abstraction layer over RDD API that's made up of partitions and tasks. Tasks are Scala functions (possibly with some Python for PySpark). A simple-looking high-level operator like DataFrame.join can

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh, It's working fine, at least for now. But you'll need to build your own spark image using later versions. Lingzhe Sun Hirain Technologies Original: From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe SunCc:Mich Talebzadeh , user Subject:Re: Re: spark streaming and

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
unsubscribe On Wed, Apr 12, 2023 at 3:59 PM Rajesh Katkar wrote: > Hi Lingzhe, > > We are also started using this operator. > Do you see any issues with it? > > > On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > >> Hi Mich, >> >> FYI we're using spark operator( >>

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
Hi Lingzhe, We are also started using this operator. Do you see any issues with it? On Wed, 12 Apr, 2023, 7:25 am Lingzhe Sun, wrote: > Hi Mich, > > FYI we're using spark operator( > https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build > stateful structured streaming on k8s

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The Internals Of" Online

PySpark tests are failed with the java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found

2023-04-12 Thread Ranga Reddy
Hi Team, I am running the pyspark tests in Spark version and it failed with P*rovider org.apache.spark.sql.sources.FakeSourceOne not found.* Spark Version: 3.4.0/3.5.0 Python Version: 3.8.10 OS: Ubuntu 20.04 *Steps: * # /opt/data/spark/build/sbt -Phive clean package #

Re: Non string type partitions

2023-04-12 Thread Charles vinodh
There are other distributed execution engines (like hive, trino) that do support non-string data types for partition columns such as date and integer. Any idea why this restriction exists in Spark? .. On Tue, 11 Apr 2023 at 20:34, Chitral Verma wrote: > Because the name of the directory

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
Hi Mich, FYI we're using spark operator(https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) to build stateful structured streaming on k8s for a year. Haven't test it using non-operator way. Besides that, the main contributor of the spark operator, Yinan Li, has been inactive for

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu, wrote: > Hi, > > The detailed stage page shows the involved WholeStageCodegen Ids in its > DAG visualization from the Spark UI when running a SparkSQL. (e.g., under > the link >

Re: Non string type partitions

2023-04-11 Thread Chitral Verma
Because the name of the directory cannot be an object, it has to be a string to create partitioned dirs like "date=2023-04-10" On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, wrote: > > Hi Team, > > We are running into the below error when we are trying to run a simple > query a partitioned table

Non string type partitions

2023-04-11 Thread Charles vinodh
Hi Team, We are running into the below error when we are trying to run a simple query a partitioned table in Spark. *MetaException(message:Filtering is supported only on partition keys of type string) * Our the partition column has been to type *date *instead of string and query is a very

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark applications in the form of containers in an automated fashion so that one can easily deploy as many instances of the application as required (autoscaling). From below:

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this "In so far as I know k8s does not support spark structured streaming?" So it is an open question. I just recalled it. I have not tested myself. I know structured streaming works on Google Dataproc cluster but I have not seen any official link that says Spark Structured

Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support spark streaming ? On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, wrote: > Do you have a high level diagram of the proposed solution? > > In so far as I know k8s does not support spark structured streaming? > > Mich

[ANNOUNCE] Apache Uniffle(Incubating) 0.7.0 available

2023-04-10 Thread Junfan Zhang
Hi all, Apache Uniffle (incubating) Team is glad to announce the new release of Apache Uniffle (incubating) 0.7.0. Apache Uniffle (incubating) is a high performance, general purpose Remote Shuffle Service for distributed compute engines like Apache Spark , Apache

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh wrote: > OK Spark Structured Streaming. > > How are you getting messages into Spark? Is it Kafka? > > This to me index that the message is incomplete or having another value in > Json > > HTH > > Mich Talebzadeh, > Lead Solutions

[SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-07 Thread Chenghao Lyu
Hi, The detailed stage page shows the involved WholeStageCodegen Ids in its DAG visualization from the Spark UI when running a SparkSQL. (e.g., under the link node:18088/history/application_1663600377480_62091/stages/stage/?id=1=0). However, I have trouble extracting the WholeStageCodegen ids

Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s Officially I could not find the connector or reader for kinesis from spark like it has for kafka. Checking here if anyone used kinesis and spark streaming combination ? On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, wrote: > Hi

RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
unsubscribe Regards, Kurt Jonske Senior Director Alvarez & Marsal Direct: 212 328 8532 Mobile: 312 560 5040 Email: kjon...@alvarezandmarsal.com www.alvarezandmarsal.com From: Mich Talebzadeh Sent: Thursday, April 06, 2023 11:45 AM To: Rajesh Katkar Cc:

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin profile

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh, What is the use case for Kinesis here? I have not used it personally, Which use case it concerns https://aws.amazon.com/kinesis/ Can you use something else instead? HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view

spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Hi Spark Team, We need to read/write the kinesis streams using spark streaming. We checked the official documentation - https://spark.apache.org/docs/latest/streaming-kinesis-integration.html It does not mention kinesis connector. Alternative is - https://github.com/qubole/kinesis-sql which is

Raise exception whilst casting instead of defaulting to null

2023-04-05 Thread Yeachan Park
Hi all, The default behaviour of Spark is to add a null value for casts that fail, unless ANSI SQL is enabled, SPARK-30292 . Whilst I understand that this is a subset of ANSI compliant behaviour, I don't understand why this feature is so

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Mich Talebzadeh
The whole idea of creating a docker container is to have a reployable self contained utility. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. The

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread Mich Talebzadeh
OK Spark Structured Streaming. How are you getting messages into Spark? Is it Kafka? This to me index that the message is incomplete or having another value in Json HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-05 Thread me
Dear Apache Spark users, I have a long running Spark application that is encountering an ArrayIndexOutOfBoundsException once every two weeks. The exception does not disrupt the operation of my app, but I'm still concerned about it and would like to find a solution. Here's some

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Ken Peng
ashok34...@yahoo.com.INVALID wrote: Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? I am using the spark image from bitnami for running on k8s. And yes, it's deployed by helm. -- https://kenpeng.pages.dev/

Potability of dockers built on different cloud platforms

2023-04-05 Thread ashok34...@yahoo.com.INVALID
Hello team Is it possible to use Spark docker built on GCP on AWS without rebuilding from new on AWS? Will that work please. AK

Re: Creating InMemory relations with data in ColumnarBatches

2023-04-04 Thread Bobby Evans
This is not going to work without changes to Spark. InMemoryTableScanExec supports columnar output, but not columnar input. You would have to write code to support that in Spark itself. The second part is that there are only a handful of operators that support columnar output. Really it is just

Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
That 3 months retention is just a soft setting. For low volume traffic, it can be negotiated to a year’s retention. Let me see what we can do about it. HTH On Tue, 4 Apr 2023 at 09:31, Bjørn Jørgensen wrote: > One of the things that I don't like about this slack solution is that > questions

Re: Slack for PySpark users

2023-04-04 Thread Bjørn Jørgensen
One of the things that I don't like about this slack solution is that questions and answers disappear after 90 days. Today's maillist solution is indexed by search engines and when one day you wonder about something, you can find solutions with the help of just searching the web. Another question

Re: Slack for PySpark users

2023-04-04 Thread Mich Talebzadeh
Hi Shani, I believe I am an admin so that is fine by me. Hi Dongioon, With regard to summarising the discussion etc, no need, It is like flogging the dead horse, we have already discussed it enough. I don't see the point of it. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead

Re: Slack for PySpark users

2023-04-04 Thread shani . alishar
Hey Dongjoon, Denny and all,I’ve created the current slack.All users have the option to create channels for different topics.I don’t see a reason for creating a new one.If anyone want to be admin on the current slack channel you all are welcome to send me a msg and I’ll grand permission.Have a

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Thank you, Denny. May I interpret your comment as a request to support multiple channels in ASF too? > because it would allow us to create multiple channels for different topics Any other reasons? Dongjoon. On Mon, Apr 3, 2023 at 5:31 PM Denny Lee wrote: > I do think creating a new Slack

Re: Slack for PySpark users

2023-04-03 Thread Denny Lee
I do think creating a new Slack channel would be helpful because it would allow us to create multiple channels for different topics - streaming, graph, ML, etc. We would need a volunteer core to maintain it so we can keep the spirit and letter of ASF / code of conduct. I’d be glad to volunteer

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Shall we summarize the discussion so far? To sum up, "ASF Slack" vs "3rd-party Slack" was the real background to initiate this thread instead of "Slack" vs "Mailing list"? If ASF Slack provides what you need, is it better than creating a new Slack channel? Or, is there another reason for us to

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I agree, whatever individual sentiments are. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
Just to be clear, if there is no strong volunteer to make the new community channel stay active, I'd probably be OK to not fork the channel. You can see a strong counter example from #spark channel in ASF. It is the place where there are only questions and promos but zero answers. I see volunteers

Re: Slack for PySpark users

2023-04-03 Thread Jungtaek Lim
The number of subscribers doesn't give any meaningful value. Please look into the number of mails being sent to the list. https://lists.apache.org/list.html?user@spark.apache.org The latest month there were more than 200 emails being sent was Feb 2022, more than a year ago. It was more than 1k in

Re: Looping through a series of telephone numbers

2023-04-03 Thread Gera Shegalov
+1 to using a UDF. E.g., TransmogrifAI uses libphonenumber https://github.com/google/libphonenumber that normalizes

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
Do you think there is a way to put it back to the official ASF-provided Slack channel? Dongjoon. On Mon, Apr 3, 2023 at 2:18 PM Mich Talebzadeh wrote: > > I for myself prefer to use the newly formed slack. > > sparkcommunitytalk.slack.com > > In summary, it may be a good idea to take a tour of

Re: Slack for PySpark users

2023-04-03 Thread Mich Talebzadeh
I for myself prefer to use the newly formed slack. sparkcommunitytalk.slack.com In summary, it may be a good idea to take a tour of it and see for yourself. Topics are sectioned as per user requests. I trust this answers your question. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead

Re: Slack for PySpark users

2023-04-03 Thread Dongjoon Hyun
As Mich Talebzadeh pointed out, Apache Spark has an official Slack channel. > It's unavoidable if "users" prefer to use an alternative communication mechanism rather than the user mailing list. The following is the number of people in the official channels. - user@spark.apache.org has 4519

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Philippe, Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Hi Mich, what exactly do you mean by « if you prefer to broadcast the reference data »? Philippe > Le 2 avr. 2023 à 18:16, Mich Talebzadeh a écrit : > > Hi Phillipe, > > These are my thoughts besides comments from Sean > > Just to clarify, you receive a CSV file periodically and you already

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Wow, you guys, Anastasios, Bjørn and Mich, are stars! Thank you very much for your suggestions. I’m going to print them and study them closely. > Le 2 avr. 2023 à 20:05, Anastasios Zouzias a écrit : > > Hi Philippe, > > I would like to draw your attention to this great library that saved my

Re: Looping through a series of telephone numbers

2023-04-02 Thread Anastasios Zouzias
Hi Philippe, I would like to draw your attention to this great library that saved my day in the past when parsing phone numbers in Spark: https://github.com/google/libphonenumber If you combine it with Bjørn's suggestions you will have a good start on your linkage task. Best regards,

Re: Looping through a series of telephone numbers

2023-04-02 Thread Bjørn Jørgensen
dataset.csv id,tel_in_dataset 1,+33 2,+331222 3,+331333 4,+331222 5,+331222 6,+331444 7,+331222 8,+331555 telephone_numbers.csv tel +331222 +331222 +331222 +331222 start spark with all of yous cpu and ram import os import multiprocessing

Re: Looping through a series of telephone numbers

2023-04-02 Thread Mich Talebzadeh
Hi Phillipe, These are my thoughts besides comments from Sean Just to clarify, you receive a CSV file periodically and you already have a file that contains valid patterns for phone numbers (reference) In a pseudo language you can probe your csv DF against the reference DF // load your

Re: Looping through a series of telephone numbers

2023-04-02 Thread Sean Owen
That won't work, you can't use Spark within Spark like that. If it were exact matches, the best solution would be to load both datasets and join on telephone number. For this case, I think your best bet is a UDF that contains the telephone numbers as a list and decides whether a given number

Re: Looping through a series of telephone numbers

2023-04-02 Thread Philippe de Rochambeau
Many thanks, Mich. Is « foreach » the best construct to lookup items is a dataset such as the below « telephonedirectory » data set? val telrdd = spark.sparkContext.parallelize(Seq(« tel1 » , « tel2 » , « tel3 » …)) // the telephone sequence // was read for a CSV file val ds =

Re: Looping through a series of telephone numbers

2023-04-01 Thread Mich Talebzadeh
This may help Spark rlike() Working with Regex Matching Example s Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Looping through a series of telephone numbers

2023-04-01 Thread Philippe de Rochambeau
Hello, I’m looking for an efficient way in Spark to search for a series of telephone numbers, contained in a CSV file, in a data set column. In pseudo code, for tel in [tel1, tel2, …. tel40,000] search for tel in dataset using .like(« %tel% ») end for I’m using the like function

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Mich Talebzadeh
Good stuff Khalid. I have created a section in Apache Spark Community Stack called spark foundation. spark-foundation - Apache Spark Community - Slack I invite you to add your weblink to that section.

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-04-01 Thread Khalid Mammadov
Hey AN-TRUONG I have got some articles about this subject that should help. E.g. https://khalidmammadov.github.io/spark/spark_internals_rdd.html Also check other Spark Internals on web. Regards Khalid On Fri, 31 Mar 2023, 16:29 AN-TRUONG Tran Phan, wrote: > Thank you for your information, >

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
yes history refers to completed jobs. 4040 is the running jobs you should have screen shots for executors and stages as well. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread AN-TRUONG Tran Phan
Thank you for your information, I have tracked the spark history server on port 18080 and the spark UI on port 4040. I see the result of these two tools as similar right? I want to know what each Task ID (Example Task ID 0, 1, 3, 4, 5, ) in the images does, is it possible?

Re: Help me learn about JOB TASK and DAG in Apache Spark

2023-03-31 Thread Mich Talebzadeh
Are you familiar with spark GUI default on port 4040? have a look. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Creating InMemory relations with data in ColumnarBatches

2023-03-31 Thread praveen sinha
Yes, purely for performance. On Thu, Mar 30, 2023, 3:01 PM Mich Talebzadeh wrote: > Is this purely for performance consideration? > > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > > >view my Linkedin profile >

Re: Slack for PySpark users

2023-03-30 Thread Jungtaek Lim
I'm reading through the page "Briefing: The Apache Way", and in the section of "Open Communications", restriction of communication inside ASF INFRA (mailing list) is more about code and decision-making. https://www.apache.org/theapacheway/#what-makes-the-apache-way-so-hard-to-define It's

Re: Creating InMemory relations with data in ColumnarBatches

2023-03-30 Thread Mich Talebzadeh
Is this purely for performance consideration? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Good discussions and proposals.all around. I have used slack in anger on a customer site before. For small and medium size groups it is good and affordable. Alternatives have been suggested as well so those who like investigative search can agree and come up with a freebie one. I am inclined to

Re: Slack for PySpark users

2023-03-30 Thread Denny Lee
+1. To Shani’s point, there are multiple OSS projects that use the free Slack version - top of mind include Delta, Presto, Flink, Trino, Datahub, MLflow, etc. On Thu, Mar 30, 2023 at 14:15 wrote: > Hey everyone, > > I think we should remain on a free program in slack. > > In my option the free

Re: Slack for PySpark users

2023-03-30 Thread shani . alishar
Hey everyone,I think we should remain on a free program in slack.In my option the free program is more then enough, the only down side is we could only see the last 90 days messages.From what I know the Airflow community (which has strong active community in slack) also use the free program (You

Re: Slack for PySpark users

2023-03-30 Thread Mridul Muralidharan
Thanks for flagging the concern Dongjoon, I was not aware of the discussion - but I can understand the concern. Would be great if you or Matei could update the thread on the result of deliberations, once it reaches a logical consensus: before we set up official policy around it. Regards, Mridul

unsubscribe

2023-03-30 Thread Daniel Tavares de Santana
unsubscribe

Re: Slack for PySpark users

2023-03-30 Thread Bjørn Jørgensen
I like the idea of having a talk channel. It can make it easier for everyone to say hello. Or to dare to ask about small or big matters that you would not have dared to ask about before on mailing lists. But then there is the price and what is the best for an open source project. The price for

Creating InMemory relations with data in ColumnarBatches

2023-03-30 Thread praveen sinha
Hi, I have been trying to implement InMemoryRelation based on spark ColumnarBatches, so far I have not been able to store the vectorised columnarbatch into the relation. Is there a way to achieve this without going with an intermediary representation like Arrow, so as to enable spark to do fast

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon to your points if I may - Do you have any reference from other official ASF-related Slack channels? No, I don't have any reference from other official ASF-related Slack channels because I don't think that matters. However, I stand corrected - To be clear, I intentionally didn't

Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
To Mich. - Do you have any reference from other official ASF-related Slack channels? - To be clear, I intentionally didn't refer to any specific mailing list because we didn't set up any rule here yet. To Xiao. I understand what you mean. That's the reason why I added Matei from your side. > I

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
Hi, Dongjoon, The other communities (e.g., Pinot, Druid, Flink) created their own Slack workspaces last year. I did not see an objection from the ASF board. At the same time, Slack workspaces are very popular and useful in most non-ASF open source communities. TBH, we are kind of late. I think we

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
Hi Dongjoon, Thanks for your point. I gather you are referring to archive as below https://lists.apache.org/list.html?user@spark.apache.org Otherwise, correct me. Thanks Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Slack for PySpark users

2023-03-30 Thread Dongjoon Hyun
Hi, Xiao and all. (cc Matei) Please hold on the vote. There is a concern expressed by ASF board because recent Slack activities created an isolated silo outside of ASF mailing list archive. We need to establish a way to embrace it back to ASF archive before starting anything official. Bests,

[ANNOUNCE] Apache Celeborn(incubating) 0.2.1 available

2023-03-30 Thread rexxiong
Hi all, Apache Celeborn(Incubating) community is glad to announce the new release of Apache Celeborn(Incubating) 0.2.1 Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
The ownership of slack belongs to spark community Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use

Re: Slack for PySpark users

2023-03-30 Thread Mich Talebzadeh
We already have it general - Apache Spark Community - Slack Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Slack for PySpark users

2023-03-30 Thread shani . alishar
Hey there,I agree, If Apache Spark PMC can maintain the spark community workspace, that would be great!Instead of creating a new one, they can also become the owner of the current one .Best regards,ShaniOn 30 Mar 2023, at 9:32, Xiao Li wrote:+1 + @d...@spark.apache.org This is a good idea. The

Re: Slack for PySpark users

2023-03-30 Thread Xiao Li
+1 + @d...@spark.apache.org This is a good idea. The other Apache projects (e.g., Pinot, Druid, Flink) have created their own dedicated Slack workspaces for faster communication. We can do the same in Apache Spark. The Slack workspace will be maintained by the Apache Spark PMC. I propose to

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
Hi Jacek, Thanks for the hints, I would rather have the information statically rather than build a logical plan. I'm using Apache Calcite to build SQL expressions and then I feed them to spark to run, so the pipeline goes like this: initial query in SQL (from the user) + schema definition (from

Re: spark.catalog.listFunctions type signatures

2023-03-28 Thread Jacek Laskowski
Hi, Interesting question indeed! The closest I could get would be to use lookupFunctionBuilder(name: FunctionIdentifier): Option[FunctionBuilder] [1] followed by extracting the dataType from T in `type FunctionBuilder = Seq[Expression] => T` which can be Expression (regular functions) or

spark.catalog.listFunctions type signatures

2023-03-28 Thread Guillaume Masse
Hi, I'm using Apache Calcite to run some SQL transformations on Apache sparks SQL statements. I would like to extract the type signature out of spark.catalog.listFunctions to be able to register them in Calcite with their proper signature. >From the API, I can get the fully qualified class name

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
https://join.slack.com/t/sparkcommunitytalk/shared_invite/zt-1rk11diac-hzGbOEdBHgjXf02IZ1mvUA Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi Bjorn, you just need to create an account on slack and join any topic I believe HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Bjørn Jørgensen
Do I need to get an invite before joining? tir. 28. mar. 2023 kl. 18:51 skrev Mich Talebzadeh < mich.talebza...@gmail.com>: > Hi all, > > There is a section in slack called webinars > > > https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG > >

Re: Topics for Spark online classes & webinars

2023-03-28 Thread Mich Talebzadeh
Hi all, There is a section in slack called webinars https://sparkcommunitytalk.slack.com/x-p4977943407059-5006939220983-5006939446887/messages/C0501NBTNQG Asma Zgolli, agreed to prepare materials for Spark internals and/or comparing spark 3 and 2. I like to contribute to "Spark Streaming &

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread lee
That is, every pagerank value has no relationship to 1, right? As long as we focus on the size of each pagerank value in Graphx, we don't need to focus on the range, is that right? | | 李杰 | | leedd1...@163.com | Replied Message | From | Sean Owen | | Date | 3/28/2023 22:33 | | To |

Re: What is the range of the PageRank value of graphx

2023-03-28 Thread Sean Owen
>From the docs: * Note that this is not the "normalized" PageRank and as a consequence pages that have no * inlinks will have a PageRank of alpha. In particular, the pageranks may have some values * greater than 1. On Tue, Mar 28, 2023 at 9:11 AM lee wrote: > When I calculate pagerank using

What is the range of the PageRank value of graphx

2023-03-28 Thread lee
When I calculate pagerank using HugeGraph, each pagerank value is less than 1, and the total of pageranks is 1. However, the PageRank value of graphx is often greater than 1, so what is the range of the PageRank value of graphx? || 李杰 | | leedd1...@163.com |

Re: Slack for PySpark users

2023-03-28 Thread Mich Talebzadeh
I created one at slack called pyspark Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use it at your

<    8   9   10   11   12   13   14   15   16   17   >