Re: Going it alone.

2020-04-14 Thread yeikel valdes
There are many use case cases for Spark. A google search with "Use cases for 
apache spark" will give you all the information that you need. 

 On Tue, 14 Apr 2020 18:44:59 -0400 janethor...@aol.com.INVALID wrote 



I did write a long email in response to you.
But then I deleted it because I felt it would be too revealing.







On Tuesday, 14 April 2020 David Hesson  wrote:

I want to know  if Spark is headed in my direction.

You are implying  Spark could be. 


What direction are you headed in, exactly? I don't feel as if anything were 
implied when you were asked for use cases or what problem you are solving. You 
were asked to identify some use cases, of which you don't appear to have any.


On Tue, Apr 14, 2020 at 4:49 PM jane thorpe  wrote:


That's what  I want to know,  Use Cases.
I am looking for  direction as I described and I want to know  if Spark is 
headed in my direction.  

You are implying  Spark could be.

So tell me about the USE CASES and I'll do the rest.

On Tuesday, 14 April 2020 yeikel valdes  wrote:

It depends on your use case. What are you trying to solve? 



 On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote 



Hi,

I consider myself to be quite good in Software Development especially using 
frameworks.

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures.

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?











Re: Going it alone.

2020-04-14 Thread yeikel valdes
It depends on your use case. What are you trying to solve? 



 On Tue, 14 Apr 2020 15:36:50 -0400 janethor...@aol.com.INVALID wrote 



Hi,

I consider myself to be quite good in Software Development especially using 
frameworks.

I like to get my hands  dirty. I have spent the last few months understanding 
modern frameworks and architectures.

I am looking to invest my energy in a product where I don't have to relying on 
the monkeys which occupy this space  we call software development.

I have found one that meets my requirements.

Would Apache Spark be a good Tool for me or  do I need to be a member of a team 
to develop  products  using Apache Spark  ?








What is the best way to take the top N entries from a hive table/data source?

2020-04-13 Thread yeikel valdes
When I use .limit() , the number of partitions for the returning dataframe is 1 
which normally fails most jobs.


val df = spark.sql("select * from table limit n")
df.write.parquet()




Thanks!







Re: Serialization or internal functions?

2020-04-07 Thread yeikel valdes
Thanks for your input Soma , but I am actually looking to understand the 
differences and not only on the performance. 


 On Sun, 05 Apr 2020 02:21:07 -0400 somplastic...@gmail.com wrote 


If you want to  measure optimisation in terms of time taken , then here is an 
idea  :)  




public class MyClass {
    public static void main(String args[]) 
    throws InterruptedException
    {
          long start  =  System.currentTimeMillis();
      
// replace with your add column code
// enough data to measure 
       Thread.sleep(5000);
  
     long end  = System.currentTimeMillis();
     
       int timeTaken = 0;
      timeTaken = (int) (end  - start );


      System.out.println("Time taken  " + timeTaken) ;
    }
}


On Sat, 4 Apr 2020, 19:07 ,  wrote:


Dear Community,

 

Recently, I had to solve the following problem “for every entry of a 
Dataset[String], concat a constant value” , and to solve it, I used built-in 
functions :

 

val data = Seq("A","b","c").toDS

 

scala> data.withColumn("valueconcat",concat(col(data.columns.head),lit(" 
"),lit("concat"))).select("valueconcat").explain()

== Physical Plan ==

LocalTableScan [valueconcat#161]

 

As an alternative , a much simpler version of the program is to use map, but it 
adds a serialization step that does not seem to be present for the version 
above :

 

scala> data.map(e=> s"$e concat").explain

== Physical Plan ==

*(1) SerializeFromObject [staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, 
java.lang.String, true], true, false) AS value#92]

+- *(1) MapElements , obj#91: java.lang.String

   +- *(1) DeserializeToObject value#12.toString, obj#90: java.lang.String

  +- LocalTableScan [value#12]

 

Is this over-optimization or is this the right way to go?  

 

As a follow up , is there any better API to get the one and only column 
available in a DataSet[String] when using built-in functions? 
“col(data.columns.head)” works but it is not ideal.

 

Thanks!

Re: IDE suitable for Spark

2020-04-07 Thread yeikel valdes

Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it is 
missing a lot of the features that we expect from an IDE.


Thanks for sharing though. 


 On Tue, 07 Apr 2020 04:45:33 -0400 zahidr1...@gmail.com wrote 


When I first logged on I asked if there was a suitable IDE for Spark.
I did get a couple of responses.

Thanks.



I did actually find one which is suitable IDE for spark.

That is  Apache Zeppelin.


One of many reasons it is suitable for Apache Spark is.
The  up and running Stage which involves typing bin/zeppelin-daemon.sh start
Go to browser and type http://localhost:8080

That's it!


Then to
Hit the ground running 

There are also ready to go Apache Spark examples
showing off the type of functionality one will be using in real life production.



Zeppelin comes with  embedded Apache Spark  and scala as default interpreter 
with 20 + interpreters.
I have gone on to discover there are a number of other advantages for real time 
production
environment with Zeppelin offered up by other Apache Products.



Backbutton.co.uk
¯\_(ツ)_/¯
♡۶Java♡۶RMI ♡۶
Make Use Method {MUM}
makeuse.org


What options do I have to handle third party classes that are not serializable?

2020-02-25 Thread yeikel valdes
I am currently using a third party library(Lucene) with Spark that is not 
serializable. Due to that reason, it generates the following exception  :


Job aborted due to stage failure: Task 144.0 in stage 25.0 (TID 2122) had a not 
serializable result: org.apache.lucene.facet.FacetsConfig Serialization stack: 
- object not serializable (class: org.apache.lucene.facet.FacetsConfig, value: 
org.apache.lucene.facet.FacetsConfg
While it would be ideal if this class was serializable, there is really nothing 
I can do to change this third party library in order to add serialization to it.
What options do I have, and what's the recommended option to handle this 
problem?
Thank you!

Re: [Spark SQL] Memory problems with packing too many joins into the same WholeStageCodegen

2020-02-25 Thread yeikel valdes
Can you please explain what you mean with that? How do you use a udf to replace 
a join? Thanks




 On Mon, 24 Feb 2020 22:06:40 -0500 jianneng...@workday.com wrote 


Thanks Genie. Unfortunately, the joins I'm doing in this case are large, so UDF 
likely won't work.


Jianneng
From: Liu Genie 
Sent: Monday, February 24, 2020 6:39 PM
To: user@spark.apache.org 
Subject: Re: [Spark SQL] Memory problems with packing too many joins into the 
same WholeStageCodegen
 
I have encountered too many joins problem before. Since the joined dataframe is 
small enough, I convert join to udf operation, which is much faster and didn’t 
generate out of memory problem.



2020年2月25日 10:15,Jianneng Li  写道:


Hello everyone,


WholeStageCodegen generates code that appends results into a 
BufferedRowIterator, which keeps the results in an in-memory linked list. Long 
story short, this is a problem when multiple joins (i.e. BroadcastHashJoin) 
that can blow up get planned into the same WholeStageCodegen - results keep on 
accumulating in the linked list, and do not get consumed fast enough, 
eventually causing the JVM to run out of memory.


Does anyone else have experience with this problem? Some obvious solutions 
include making BufferedRowIterator spill the linked list, or make it bounded, 
but I'd imagine that this would have been done a long time ago if it were 
necessary.


Thanks,


Jianneng



Re: union two pyspark dataframes from different SparkSessions

2020-01-29 Thread yeikel valdes
>From what I understand, the session is a singleton so even if you think you 
>are creating new instances you are just reusing it. 




 On Wed, 29 Jan 2020 02:24:05 -1100 icbm0...@gmail.com wrote 


Dear all

I already had a python function which is used to query data from HBase and
HDFS with given parameters. This function returns a pyspark dataframe and
the SparkContext it used.

With client's increasing demands, I need to merge data from multiple query.
I tested using "union" function to merge the pyspark dataframes returned by
different function calls directly and it worked. This surprised me that
pyspark dataframe can actually union dataframes from different SparkSession.

I am using pyspark 2.3.1 and Python 3.5.

I wonder if this is a good practice or I better use same SparkSession for
all the query?

Best regards



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [External]Re: spark 2.x design docs

2019-09-19 Thread yeikel valdes
I am also interested. Many of the docs/books that I've seen are 
practical/examples about usage rather than deep internals of Spark.




 On Wed, 18 Sep 2019 21:12:12 -1100 vipul.s.p...@gmail.com wrote 


Yes,

I realize what you were looking for, I am also looking for the same docs. 
Haven't found em yet. Also, jacek laskowski's gitbooks are the next best thing 
to follow. If you haven't yet.


Regards


On Thu, Sep 19, 2019 at 12:46 PM  wrote:


Thanks Vipul,

 

I was looking specifically for documents spark committer use for reference.

 

Currently I’ve put custom logs in spark-core sources then building and running 
jobs on it.

Form printed logs I try to understand execution flows.

 

From: Vipul Rajan 
Sent: Thursday, September 19, 2019 12:23 PM
To: Kamal7 Kumar 
Cc: spark-user 
Subject: [External]Re: spark 2.x design docs

 

The e-mail below is from an external source. Please do not open attachments or 
click links from an unknown or suspicious origin.

https://github.com/JerryLead/SparkInternals/blob/master/EnglishVersion/2-JobLogicalPlan.md
This is pretty old. but it might help a little bit. I myself am going through 
the source code and trying to reverse engineer stuff. Let me know if you'd like 
to pool resources sometime.

 

Regards

 

On Thu, Sep 19, 2019 at 11:35 AM  wrote:

Hi ,

Can someone provide documents/links (apart from official documentation) for 
understanding internal workings of spark-core,

Document containing components pseudo codes, class diagrams, execution flows , 
etc.

Thanks, Kamal


"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s), are confidential and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, re-transmission, conversion to hard copy, copying, circulation or 
other use of this message and any attachments is strictly prohibited. If you 
are not the intended recipient, please notify the sender immediately by return 
email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. The company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachment."


"Confidentiality Warning: This message and any attachments are intended only 
for the use of the intended recipient(s), are confidential and may be 
privileged. If you are not the intended recipient, you are hereby notified that 
any review, re-transmission, conversion to hard copy, copying, circulation or 
other use of this message and any attachments is strictly prohibited. If you 
are not the intended recipient, please notify the sender immediately by return 
email and delete this message and any attachments from your system.

Virus Warning: Although the company has taken reasonable precautions to ensure 
no viruses are present in this email. The company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachment."

Re:Does Spark SQL has match_recognize?

2019-05-26 Thread yeikel valdes
Isn't match_recognize just a filter?

df.filter(predicate)?

 On Sat, 25 May 2019 12:55:47 -0700 kanth...@gmail.com wrote 

Hi All,

Does Spark SQL has match_recognize? I am not sure why CEP seems to be neglected 
I believe it is one of the most useful concepts in the Financial applications!

Is there a plan to support it?

Thanks!

Re:Load Time from HDFS

2019-04-10 Thread yeikel valdes
What about a simple call to nanotime?

long startTime = System.nanoTime();

//Spark work here

long endTime = System.nanoTime();

long duration = (endTime - startTime)

println(duration)

Count recomputes the df so it makes sense it takes longer for you.

 On Tue, 02 Apr 2019 07:06:30 -0700 koloka...@ics.forth.gr wrote 

Hello, 

I want to ask if there any way to measure HDFS data loading time at  
the start of my program. I tried to add an action e.g count() after val 
data = sc.textFile() call. But I notice that my program takes more time 
to finish than before adding count call. Is there any other way to do it ? 

Thanks, 
--Iacovos 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 



Re: Question about relationship between number of files and initial tasks(partitions)

2019-04-10 Thread yeikel valdes
If you need to reduce the number of partitions you could also try df.coalesce

 On Thu, 04 Apr 2019 06:52:26 -0700 jasonnerot...@gmail.com wrote 

Have you tried something like this?

spark.conf.set("spark.sql.shuffle.partitions", "5" ) 



On Wed, Apr 3, 2019 at 8:37 PM Arthur Li  wrote:
Hi Sparkers,

I noticed that in my spark application, the number of tasks in the first stage 
is equal to the number of files read by the application(at least for Avro) if 
the number of cpu cores is less than the number of files. Though If cpu cores 
are more than number of files, it's usually equal to default parallelism 
number. Why is it behave like this? Would this require a lot of resource from 
the driver? Is there any way we can do to decrease the number of 
tasks(partitions) in the first stage without merge files before loading? 

Thanks,
Arthur 


IMPORTANT NOTICE:  This message, including any attachments (hereinafter 
collectively referred to as "Communication"), is intended only for the 
addressee(s) named above.  This Communication may include information that is 
privileged, confidential and exempt from disclosure under applicable law.  If 
the recipient of this Communication is not the intended recipient, or the 
employee or agent responsible for delivering this Communication to the intended 
recipient, you are notified that any dissemination, distribution or copying of 
this Communication is strictly prohibited.  If you have received this 
Communication in error, please notify the sender immediately by phone or email 
and permanently delete this Communication from your computer without making a 
copy. Thank you.


-- 
Thanks,
Jason

Re:Parquet file number of columns

2019-01-07 Thread yeikel valdes
Not according to Parquet dev group

https://groups.google.com/forum/m/#!topic/parquet-dev/jj7TWPIUlYI

 On Mon, 07 Jan 2019 05:11:51 -0800 gourav.sengu...@gmail.com wrote 

Hi,

Is there any limit to the number of columns that we can have in Parquet file 
format? 


Thanks and Regards,
Gourav Sengupta

Re: Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Ideally...we would like to copy paste and try in our end. A screenshot is not 
enough.

If you have private information just remove and create a minimum example we can 
use to replicate the issue.
I'd say similar to this :

https://stackoverflow.com/help/mcve

 On Mon, 07 Jan 2019 04:15:16 -0800 fyyleej...@163.com wrote 

Sorry,the code is too long,it is simple to say 
look at the photo 

 

i define a arrayBuffer ,there are "1 2", '' 2 3" ," 4 5" in it ,I want to 
save in hdfs ,so i make it to RDD, 
sc. pallelize(arraybuffeer) 
but when in idea,i use println(_),the value is right,but in distributed 
there is nothing 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 



RE: Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
Any chance you can share a minimum example to replicate the issue?

 On Mon, 07 Jan 2019 04:17:44 -0800 shashikantbang...@discover.com wrote 


Hi Valdes,

 

Thank you for your response, to answer to your question. yes I can

 

@ben : correct me if I am wrong.

 

Cheers,

Shashi

 

Shashikant Bangera | DevOps Engineer

Payment Services DevOps Engineering

Email: shashikantbang...@discover.com

Group email: eppdev...@discover.com

Tel: +44 (0)

Mob: +44 (0) 7440783885

 

 

From: yeikel valdes [mailto:em...@yeikel.com] 
Sent: 07 January 2019 12:15
To: Shashikant Bangera 
Cc: user@spark.apache.org
Subject: [EXTERNAL] Re: Spark Kinesis Connector SSL issue

 

CAUTION EXTERNAL EMAIL 
DO NOT open attachments or click on links from unknown senders or unexpected 
emails.

 

Can you call this service with regular code(No Spark)?

 


 On Mon, 07 Jan 2019 02:42:48 -0800 shashikantbang...@discover.com wrote 


Hi team, 

please help , we are kind of blocked here. 

Cheers, 
Shashi 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

 

Re: Spark Kinesis Connector SSL issue

2019-01-07 Thread yeikel valdes
Can you call this service with regular code(No Spark)?

 On Mon, 07 Jan 2019 02:42:48 -0800 shashikantbang...@discover.com wrote 


Hi team, 

please help , we are kind of blocked here. 

Cheers, 
Shashi 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org 



Fwd:Re: Can an UDF return a custom class other than case class?

2019-01-07 Thread yeikel valdes


 Forwarded Message 
>From : em...@yeikel.com
To : kfehl...@gmail.com
Date : Mon, 07 Jan 2019 04:11:22 -0800
Subject : Re: Can an UDF return a custom class other than case class?


In this case I am just curious because I'd like to know if it is possible. 

At the same time I will be interacting with external Java class files if that's 
allowed.

Also, what are the equivalents for other languages like Java? I am not aware of 
anything similar to the case class in Java.

I am currently using Scala but I might use PySpark or the Java apis in the 
future.

Thank you

 On Sun, 06 Jan 2019 22:06:28 -0800 kfehl...@gmail.com wrote 

Is there a reason why case classes won't work for your use case?

On Sun, Jan 6, 2019 at 10:43 PM  wrote:
Hi ,

 

Is it possible to return a custom class from an UDF other than a case class?

 

If so , how can we avoid this exception ? : 
java.lang.UnsupportedOperationException: Schema for type {custom type} is not 
supported

 

Full Example :

 

import spark.implicits._

import org.apache.spark.sql.functions.udf

 

class Person (val name : String)

 

val toPerson = (s1 : String) => new Person(s1)

 

val dataset = Seq("John Smith").toDF("name")

 

val personUDF = udf(toPerson)

 

java.lang.UnsupportedOperationException: Schema for type Person is not supported

  at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780)

  at 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715)

  at 
scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)

  at 
org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)

  at 
org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)

  at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714)

  at 
org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:711)

  at org.apache.spark.sql.functions$.udf(functions.scala:3340)

 

dataset.withColumn("person", personUDF($"name"))

 

 

Thank you.




Re:Writing RDDs to HDFS is empty

2019-01-07 Thread yeikel valdes
Please share a minimum amount of code to try reproduce the issue...

 On Mon, 07 Jan 2019 00:46:42 -0800 fyyleej...@163.com wrote 

Hi all, 
In my experiment program,I used spark Graphx, 
when running on the Idea in windows,the result is right, 
but when runing on the linux distributed cluster,the result in hdfs is 
empty, 
why?how to solve? 

 

Thanks! 
Jian Li 



-- 
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ 

- 
To unsubscribe e-mail: user-unsubscr...@spark.apache.org