Use GraphFrames? On Sat, Jul 16, 2022 at 3:54 PM Yuhao Zhang <yhzhang1...@gmail.com> wrote:
> Hi Shay, > > Thanks for your reply! I would very much like to use pyspark. However, my > project depends on GraphX, which is only available in the Scala API as far > as I know. So I'm locked with Scala and trying to find a way out. I wonder > if there's a way to go around it. > > Best regards, > Yuhao Zhang > > > On Sun, Jul 10, 2022 at 5:36 AM Shay Elbaz <shay.el...@gm.com> wrote: > >> Yuhao, >> >> >> You can use pyspark as entrypoint to your application. With py4j you can >> call Java/Scala functions from the python application. There's no need to >> use the pipe() function for that. >> >> >> Shay >> ------------------------------ >> *From:* Yuhao Zhang <yhzhang1...@gmail.com> >> *Sent:* Saturday, July 9, 2022 4:13:42 AM >> *To:* user@spark.apache.org >> *Subject:* [EXTERNAL] RDD.pipe() for binary data >> >> >> *ATTENTION:* This email originated from outside of GM. >> >> >> Hi All, >> >> I'm currently working on a project involving transferring between Spark >> 3.x (I use Scala) and a Python runtime. In Spark, data is stored in an RDD >> as floating-point number arrays/vectors and I have custom routines written >> in Python to process them. On the Spark side, I also have some operations >> specific to Spark Scala APIs, so I need to use both runtimes. >> >> Now to achieve data transfer I've been using the RDD.pipe() API, by 1. >> converting the arrays to strings in Spark and calling RDD.pipe(script.py) >> 2. Then Python receives the strings and casts them as Python's data >> structures and conducts operations. 3. Python converts the arrays into >> strings and prints them back to Spark. 4. Spark gets the strings and cast >> them back as arrays. >> >> Needless to say, this feels unnatural and slow to me, and there are some >> potential floating-point number precision issues, as I think the floating >> number arrays should have been transmitted as raw bytes. I found no way to >> use the RDD.pipe() for this purpose, as written in >> https://github.com/apache/spark/blob/3331d4ccb7df9aeb1972ed86472269a9dbd261ff/core/src/main/scala/org/apache/spark/rdd/PipedRDD.scala#L139, >> .pipe() seems to be locked with text-based streaming. >> >> Can anyone shed some light on how I can achieve this? I'm trying to come >> up with a way that does not involve modifying the core Spark myself. One >> potential solution I can think of is saving/loading the RDD as binary files >> but I'm hoping to find a streaming-based solution. Any help is much >> appreciated, thanks! >> >> >> Best regards, >> Yuhao >> >