Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Gavin Ray
Wow, really neat -- thanks for sharing!

On Mon, Jul 3, 2023 at 8:12 PM Gengliang Wang  wrote:

> Dear Apache Spark community,
>
> We are delighted to announce the launch of a groundbreaking tool that aims
> to make Apache Spark more user-friendly and accessible - the English SDK
> . Powered by the
> application of Generative AI, the English SDK
>  allows you to execute
> complex tasks with simple English instructions. This exciting news was 
> announced
> recently at the Data+AI Summit
>  and also introduced
> through a detailed blog post
> 
> .
>
> Now, we need your invaluable feedback and contributions. The aim of the
> English SDK is not only to simplify and enrich your Apache Spark experience
> but also to grow with the community. We're calling upon Spark developers
> and users to explore this innovative tool, offer your insights, provide
> feedback, and contribute to its evolution.
>
> You can find more details about the SDK and usage examples on the GitHub
> repository https://github.com/databrickslabs/pyspark-ai/. If you have any
> feedback or suggestions, please feel free to open an issue directly on the
> repository. We are actively monitoring the issues and value your insights.
>
> We also welcome pull requests and are eager to see how you might extend or
> refine this tool. Let's come together to continue making Apache Spark more
> approachable and user-friendly.
>
> Thank you in advance for your attention and involvement. We look forward
> to hearing your thoughts and seeing your contributions!
>
> Best,
> Gengliang Wang
>


Re: Complexity with the data

2022-05-25 Thread Gavin Ray
Forgot to reply-all last message, whoops. Not very good at email.

You need to normalize the CSV with a parser that can escape commas inside
of strings
Not sure if Spark has an option for this?


On Wed, May 25, 2022 at 4:37 PM Sid  wrote:

> Thank you so much for your time.
>
> I have data like below which I tried to load by setting multiple options
> while reading the file but however, but I am not able to consolidate the
> 9th column data within itself.
>
> [image: image.png]
>
> I tried the below code:
>
> df = spark.read.option("header", "true").option("multiline",
> "true").option("inferSchema", "true").option("quote",
>
> '"').option(
> "delimiter", ",").csv("path")
>
> What else I can do?
>
> Thanks,
> Sid
>
>
> On Thu, May 26, 2022 at 1:46 AM Apostolos N. Papadopoulos <
> papad...@csd.auth.gr> wrote:
>
>> Dear Sid,
>>
>> can you please give us more info? Is it true that every line may have a
>> different number of columns? Is there any rule followed by
>>
>> every line of the file? From the information you have sent I cannot
>> fully understand the "schema" of your data.
>>
>> Regards,
>>
>> Apostolos
>>
>>
>> On 25/5/22 23:06, Sid wrote:
>> > Hi Experts,
>> >
>> > I have below CSV data that is getting generated automatically. I can't
>> > change the data manually.
>> >
>> > The data looks like below:
>> >
>> > 2020-12-12,abc,2000,,INR,
>> > 2020-12-09,cde,3000,he is a manager,DOLLARS,nothing
>> > 2020-12-09,fgh,,software_developer,I only manage the development part.
>> >
>> > Since I don't have much experience with the other domains.
>> >
>> > It is handled by the other people.,INR
>> > 2020-12-12,abc,2000,,USD,
>> >
>> > The third record is a problem. Since the value is separated by the new
>> > line by the user while filling up the form. So, how do I handle this?
>> >
>> > There are 6 columns and 4 records in total. These are the sample
>> records.
>> >
>> > Should I load it as RDD and then may be using a regex should eliminate
>> > the new lines? Or how it should be? with ". /n" ?
>> >
>> > Any suggestions?
>> >
>> > Thanks,
>> > Sid
>>
>> --
>> Apostolos N. Papadopoulos, Associate Professor
>> Department of Informatics
>> Aristotle University of Thessaloniki
>> Thessaloniki, GREECE
>> tel: ++0030312310991918
>> email: papad...@csd.auth.gr
>> twitter: @papadopoulos_ap
>> web: http://datalab.csd.auth.gr/~apostol
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>


Re: [Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-18 Thread Gavin Ray
Following up on this in case anyone runs across it in the archives in the
future
>From reading through the config docs and trying various combinations, I've
discovered that:

- You don't want to disable codegen. This roughly doubled the time to
perform simple, few-column/few-row queries from basic testing
  -  Can test this by setting an internal property after setting
"spark.testing" to "true" in system properties


> System.setProperty("spark.testing", "true")
> val spark = SparkSession.builder()
>   .config("spark.sql.codegen.wholeStage", "false")
>   .config("spark.sql.codegen.factoryMode", "NO_CODEGEN")
>

-  The following gave the best performance. I don't know if enabling CBO
did much.

val spark = SparkSession.builder()
> .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> .config("spark.kryo.unsafe", "true")
> .config("spark.sql.adaptive.enabled", "true")
> .config("spark.sql.cbo.enabled", "true")
> .config("spark.sql.cbo.joinReorder.dp.star.filter", "true")
> .config("spark.sql.cbo.joinReorder.enabled", "true")
> .config("spark.sql.cbo.planStats.enabled", "true")
> .config("spark.sql.cbo.starSchemaDetection", "true")


If you're running on more recent JDK's, you'll need to set "--add-opens"
flags for a few namespaces for "kryo.unsafe" to work.



On Mon, May 16, 2022 at 12:55 PM Gavin Ray  wrote:

> Hi all,
>
> I've not got much experience with Spark, but have been reading the
> Catalyst and
> Datasources V2 code/tests to try to get a basic understanding.
>
> I'm interested in trying Catalyst's query planner + optimizer for queries
> spanning one-or-more JDBC sources.
>
> Somewhat unusually, I'd like to do this with as minimal latency as
> possible to
> see what the experience for standard line-of-business apps is like (~90/10
> read/write ratio).
> Few rows would be returned in the reads (something on the order of
> 1-to-1,000).
>
> My question is: What configuration settings would you want to use for
> something
> like this?
>
> I imagine that doing codegen/JIT compilation of the query plan might not be
> worth the cost, so maybe you'd want to disable that and do interpretation?
>
> And possibly you'd want to use query plan config/rules that reduce the time
> spent in planning, trading efficiency for latency?
>
> Does anyone know how you'd configure Spark to test something like this?
>
> Would greatly appreciate any input (even if it's "This is a bad idea and
> will
> never work well").
>
> Thank you =)
>


[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

2022-05-18 Thread Gavin Ray
I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115

The output of "spark.time()" surprised me:

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1

+---++---+--+
| id|name| id| title|
+---++---+--+
|  1| Bob|  1|Todo 1|
|  1| Bob|  2|Todo 2|
+---++---+--+
Time taken: 168 ms

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2
LIMIT 1

+---+-+---+--+
| id| name| id| title|
+---+-+---+--+
|  2|Alice|  3|Todo 3|
+---+-+---+--+
Time taken: 228 ms


Calcite and Teiid manage to do this on the order of 5-50ms for basic
queries,
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?


[Spark SQL]: Configuring/Using Spark + Catalyst optimally for read-heavy transactional workloads in JDBC sources?

2022-05-16 Thread Gavin Ray
Hi all,

I've not got much experience with Spark, but have been reading the Catalyst
and
Datasources V2 code/tests to try to get a basic understanding.

I'm interested in trying Catalyst's query planner + optimizer for queries
spanning one-or-more JDBC sources.

Somewhat unusually, I'd like to do this with as minimal latency as possible
to
see what the experience for standard line-of-business apps is like (~90/10
read/write ratio).
Few rows would be returned in the reads (something on the order of
1-to-1,000).

My question is: What configuration settings would you want to use for
something
like this?

I imagine that doing codegen/JIT compilation of the query plan might not be
worth the cost, so maybe you'd want to disable that and do interpretation?

And possibly you'd want to use query plan config/rules that reduce the time
spent in planning, trading efficiency for latency?

Does anyone know how you'd configure Spark to test something like this?

Would greatly appreciate any input (even if it's "This is a bad idea and
will
never work well").

Thank you =)


unsubscribe

2022-05-02 Thread Ray Qiu



Re: No SparkR on Mesos?

2016-09-08 Thread ray
Hi, Rodrick,

Interesting. SparkR is expected not to work with Mesos due to lack of support 
for mesos in some places, and it has not been tested yet.

Have you modified Spark source code by yourself? Have you deployed Spark binary 
distribution on all salve nodes, and set “spark.mesos.executor.home” to point 
to it?

It would be cool that you can contribute a patch:)

From:  

How to build Spark with my own version of Hadoop?

2015-07-21 Thread Dogtail Ray
Hi,

I have modified some Hadoop code, and want to build Spark with the modified
version of Hadoop. Do I need to change the compilation dependency files?
How to then? Great thanks!


Re: init / shutdown for complex map job?

2014-12-28 Thread Ray Melton
A follow-up to the blog cited below was hinted at, per But Wait,
There's More ... To keep this post brief, the remainder will be left to
a follow-up post.

Is this follow-up pending?  Is it sort of pending?  Did the follow-up
happen, but I just couldn't find it on the web?

Regards, Ray.


On Sun, 28 Dec 2014 08:54:13 +
Sean Owen so...@cloudera.com wrote:

 You can't quite do cleanup in mapPartitions in that way. Here is a
 bit more explanation (farther down):
 http://blog.cloudera.com/blog/2014/09/how-to-translate-from-mapreduce-to-apache-spark/
 On Dec 28, 2014 8:18 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:
 
  Something like?
 
  val a = myRDD.mapPartitions(p = {
 
 
 
  //Do the init
 
  //Perform some operations
 
  //Shut it down?
 
   })


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-15 Thread Ray
Hi Xiangrui,

I am using yarn-cluster mode. The current hadoop cluster is configured to
only accept yarn-cluster mode and not allow yarn-client mode. I have no
prevelige to change that.

Without initializing with k-means||, the job finished in 10 minutes. With
k-means, it just hangs there for almost 1 hour.

I guess I can only go with random initialization in KMeans.

Thanks again for your help.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16530.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys,

I am new to Spark. When I run Spark Kmeans
(org.apache.spark.mllib.clustering.KMeans) on a small dataset, it works
great. However, when using a large dataset with 1.5 million vectors, it just
hangs there at some reducyByKey/collectAsMap stages (attached image shows
the corresponding UI).

http://apache-spark-user-list.1001560.n3.nabble.com/file/n16413/spark.png 



In the log file, I can see the errors below:

14/10/14 13:04:30 ERROR ConnectionManager: Corresponding
SendingConnectionManagerId not found
14/10/14 13:04:30 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@4aeed0e6
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl.java:87)
at
java.nio.channels.SelectionKey.isConnectable(SelectionKey.java:336)
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:352)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/10/14 13:04:30 ERROR SendingConnection: Exception while reading
SendingConnection to ConnectionManagerId(server_name_here,32936)
java.nio.channels.ClosedChannelException
at
sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:252)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:295)
at
org.apache.spark.network.SendingConnection.read(Connection.scala:397)
at
org.apache.spark.network.ConnectionManager$$anon$6.run(ConnectionManager.scala:176)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/10/14 13:04:30 INFO ConnectionManager: Key not valid ?
sun.nio.ch.SelectionKeyImpl@2d584a4e
14/10/14 13:04:30 INFO ConnectionManager: Removing ReceivingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,37767)
14/10/14 13:04:30 INFO ConnectionManager: key already cancelled ?
sun.nio.ch.SelectionKeyImpl@2d584a4e
java.nio.channels.CancelledKeyException
at
org.apache.spark.network.ConnectionManager.run(ConnectionManager.scala:363)
at
org.apache.spark.network.ConnectionManager$$anon$4.run(ConnectionManager.scala:116)
14/10/14 13:04:30 INFO ConnectionManager: Handling connection error on
connection to ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 INFO ConnectionManager: Removing SendingConnection to
ConnectionManagerId(server_name_here,32936)
14/10/14 13:04:30 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@server_name_here:44765] -
[akka.tcp://spark@server_name_here:46406] disassociated! Shutting down.




Regarding the above errors, I searched online and tried increasing the
following confs, but still did not work.

spark.worker.timeout=3 

spark.akka.timeout=3 
spark.akka.retry.wait=3 
spark.akka.frameSize=1

spark.storage.blockManagerHeartBeatMs=3  

--driver-memory 2g
--executor-memory 2g
--num-executors 100



I am running spark-submit on YARN. The Spark version is 1.1.0,  and Hadoop
is 2.4.1.

Could you please some comments/insights?

Thanks a lot.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi guys,

An interesting thing, for the input dataset which has 1.5 million vectors,
if set the KMeans's k_value = 100 or k_value = 50, it hangs as mentioned
above. However, if decrease k_value  = 10, the same error still appears in
the log but the application finished successfully, without observable
hanging.

Hopefully this provides more information.

Thanks.

Ray



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui,

The input dataset has 1.5 million sparse vectors. Each sparse vector has a
dimension(cardinality) of 9153 and has less than 15 nonzero elements.


Yes, if I set num-executors = 200, from the hadoop cluster scheduler, I can
see the application got  201 vCores. From the spark UI, I can see it got 201
executors (as shown below).

http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_core.png
  

http://apache-spark-user-list.1001560.n3.nabble.com/file/n16428/spark_executor.png
 



Thanks.

Ray




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16428.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Burak,

In Kmeans, I used k_value = 100, num_iteration = 2, and num_run = 1.

In the current test, I increase num-executors = 200. In the storage info 2
(as shown below), 11 executors are  used (I think the data is kind of
balanced) and others have zero memory usage. 
http://apache-spark-user-list.1001560.n3.nabble.com/file/n16438/spark_storage.png
 


Currently, there is no active stage running, just as the first image I
posted in the first place. You mentioned It seems that it is hanging, but
there is a lot of calculation going on. I thought if some calculation is
going on, there would be an active stage with an incomplete progress bar in
the UI. Am I wrong? 


Thanks, Burak!

Ray



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16438.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark KMeans hangs at reduceByKey / collectAsMap

2014-10-14 Thread Ray
Hi Xiangrui,

Thanks for the guidance. I read the log carefully and found the root cause. 

KMeans, by default, uses KMeans++ as the initialization mode. According to
the log file, the 70-minute hanging is actually the computing time of
Kmeans++, as pasted below:

14/10/14 14:48:18 INFO DAGScheduler: Stage 20 (collectAsMap at
KMeans.scala:293) finished in 2.233 s
14/10/14 14:48:18 INFO SparkContext: Job finished: collectAsMap at
KMeans.scala:293, took 85.590020124 s
14/10/14 14:48:18 INFO ShuffleBlockManager: Could not find files for shuffle
5 for deleting
14/10/14 *14:48:18* INFO ContextCleaner: Cleaned shuffle 5
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeSystemBLAS
14/10/14 15:50:41 WARN BLAS: Failed to load implementation from:
com.github.fommil.netlib.NativeRefBLAS
*14/10/14 15:54:36 INFO LocalKMeans: Local KMeans++ converged in 11
iterations.
14/10/14 15:54:36 INFO KMeans: Initialization with k-means|| took 4426.913
seconds.*
14/10/14 15:54:37 INFO SparkContext: Starting job: collectAsMap at
KMeans.scala:190
14/10/14 15:54:37 INFO DAGScheduler: Registering RDD 38 (reduceByKey at
KMeans.scala:190)
14/10/14 15:54:37 INFO DAGScheduler: Got job 16 (collectAsMap at
KMeans.scala:190) with 100 output partitions (allowLocal=false)
14/10/14 15:54:37 INFO DAGScheduler: Final stage: Stage 22(collectAsMap at
KMeans.scala:190)



I now use random as the Kmeans initialization mode, and other confs remain
the same. This time, it just finished quickly~~

In your test on mnis8m, did you use KMeans++ as initialization mode? How
long it takes?

Thanks again for your help.

Ray







--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-KMeans-hangs-at-reduceByKey-collectAsMap-tp16413p16450.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark cluster spanning multiple data centers

2014-07-23 Thread Ray Qiu
Hi,

Is it feasible to deploy a Spark cluster spanning multiple data centers if
there is fast connections with not too high latency (30ms) between them?  I
don't know whether there is any presumptions in the software expecting all
cluster nodes to be local (super low latency, for instance).  Has anyone
tried this?

Thanks,
Ray