Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone,

We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 
3.5.1.

I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 
21?

Thanks,

Steve C


This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: [spark-graphframes]: Generating incorrect edges

2024-04-30 Thread Stephen Coy
Hi Mich,

I was just reading random questions on the user list when I noticed that you 
said:

On 25 Apr 2024, at 2:12 AM, Mich Talebzadeh  wrote:

1) You are using monotonically_increasing_id(), which is not 
collision-resistant in distributed environments like Spark. Multiple hosts
   can generate the same ID. I suggest switching to UUIDs (e.g., uuid.uuid4()) 
for guaranteed uniqueness.


It’s my understanding that the *Spark* `monotonically_increasing_id()` function 
exists for the exact purpose of generating a collision-resistant unique id 
across nodes on different hosts.
We use it extensively for this purpose and have never encountered an issue.

Are we wrong or are you thinking of a different (not Spark) function?

Cheers,

Steve C




This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Spark-SQL - Query Hanging, How To Troubleshoot

2023-08-10 Thread Stephen Coy
Hi Patrick,

When this has happened to me in the past (admittedly via spark-submit) it has 
been because another job was still running and had already claimed some of the 
resources (cores and memory).

I think this can also happen if your configuration tries to claim resources 
that will never be available.

Cheers,

SteveC


On 11 Aug 2023, at 3:36 am, Patrick Tucci  wrote:

Hello,

I'm attempting to run a query on Spark 3.4.0 through the Spark ThriftServer. 
The cluster has 64 cores, 250GB RAM, and operates in standalone mode using HDFS 
for storage.

The query is as follows:

SELECT ME.*, MB.BenefitID
FROM MemberEnrollment ME
JOIN MemberBenefits MB
ON ME.ID = MB.EnrollmentID
WHERE MB.BenefitID = 5
LIMIT 10

The tables are defined as follows:

-- Contains about 3M rows
CREATE TABLE MemberEnrollment
(
ID INT
, MemberID VARCHAR(50)
, StartDate DATE
, EndDate DATE
-- Other columns, but these are the most important
) STORED AS ORC;

-- Contains about 25m rows
CREATE TABLE MemberBenefits
(
EnrollmentID INT
, BenefitID INT
) STORED AS ORC;

When I execute the query, it runs a single broadcast exchange stage, which 
completes after a few seconds. Then everything just hangs. The JDBC/ODBC tab in 
the UI shows the query state as COMPILED, but no stages or tasks are executing 
or pending:



I've let the query run for as long as 30 minutes with no additional stages, 
progress, or errors. I'm not sure where to start troubleshooting.

Thanks for your help,

Patrick

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Dataproc serverless for Spark

2022-11-21 Thread Stephen Boesch
Out of curiosity : are there functional limitations in Spark Standalone
that are of concern?  Yarn is more configurable for running non-spark
workloads and how to run multiple spark jobs in parallel. But for a single
spark job it seems standalone launches more quickly and does not miss any
features. Are there specific limitations you are aware of / run into?

stephen b

On Mon, 21 Nov 2022 at 09:01, Mich Talebzadeh 
wrote:

> Hi,
>
> I have not tested this myself but Google have brought up *Dataproc Serverless
> for Spar*k. in a nutshell Dataproc Serverless lets you run Spark batch
> workloads without requiring you to provision and manage your own cluster.
> Specify workload parameters, and then submit the workload to the Dataproc
> Serverless service. The service will run the workload on a managed compute
> infrastructure, autoscaling resources as needed. Dataproc Serverless
> charges apply only to the time when the workload is executing. Google
> Dataproc is similar to Amazon EMR
>
> So in short you don't need to provision your own Dataproc cluster etc. One
> thing Inoticed from release doc
> <https://cloud.google.com/dataproc-serverless/docs/overview>is that the
> resource management is *spark based a*s opposed to standard Dataproc
> which iis YARN based. It is available for Spark 3.2. My assumption is
> that by Spark based it means that spark is running in standalone mode. Has
> there been much improvement in release 3.2 for standalone mode?
>
> Thanks
>
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Spark Scala Contract Opportunity @USA

2022-11-10 Thread Stephen Boesch
Please do not send advertisements on this channel.

On Thu, 10 Nov 2022 at 13:40, sri hari kali charan Tummala <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> Is anyone looking for a spark scala contract role inside the USA? A
> company called Maxonic has an open spark scala contract position (100%
> remote) inside the USA if anyone is interested, please send your CV to
> kali.tumm...@gmail.com.
>
> Thanks & Regards
> Sri Tummala
>
>


Re: [Building] Building with JDK11

2022-07-18 Thread Stephen Coy
Hi Sergey,

I’m willing to be corrected, but I think the use of a JAVA_HOME environment 
variable was something that was started by and continues to be perpetuated by 
Apache Tomcat.

… or maybe Apache Ant, but modern versions of Ant do not need it either.

It is not needed for modern releases of Apache Maven.

Cheers,

Steve C

On 18 Jul 2022, at 4:12 pm, Sergey B. 
mailto:sergey.bushma...@gmail.com>> wrote:

Hi Steve,

Can you shed some light why do they need $JAVA_HOME at all if everything is 
already in place?

Regards,
- Sergey

On Mon, Jul 18, 2022 at 4:31 AM Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi Szymon,

There seems to be a common misconception that setting JAVA_HOME will set the 
version of Java that is used.

This is not true, because in most environments you need to have a PATH 
environment variable set up that points at the version of Java that you want to 
use.

You can set JAVA_HOME to anything at all and `java -version` will always return 
the same result.

The way that you configure PATH varies from OS to OS:


  *   MacOS use `/usr/libexec/java_home -v11`
  *   On linux use `sudo alternatives --config java`
  *   On Windows I have no idea

When you do this the `mvn` command will compute the value of JAVA_HOME for its 
own use; there is no need to explicitly set it yourself (unless other Java 
applications that you use insist u[on it).


Cheers,

Steve C

On 16 Jul 2022, at 7:24 am, Szymon Kuryło 
mailto:szymonkur...@gmail.com>> wrote:

Hello,

I'm trying to build a Java 11 Spark distro using the dev/make-distribution.sh 
script.
I have set JAVA_HOME to point to JDK11 location, I've also set the java.version 
property in pom.xml to 11, effectively also setting `maven.compile.source` and 
`maven.compile.target`.
When inspecting classes from the `dist` directory with `javap -v`, I find that 
the class major version is 52, which is specific to JDK8. Am I missing 
something? Is there a reliable way to set the JDK used in the build process?

Thanks,
Szymon K.

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Re: [Building] Building with JDK11

2022-07-17 Thread Stephen Coy
Hi Szymon,

There seems to be a common misconception that setting JAVA_HOME will set the 
version of Java that is used.

This is not true, because in most environments you need to have a PATH 
environment variable set up that points at the version of Java that you want to 
use.

You can set JAVA_HOME to anything at all and `java -version` will always return 
the same result.

The way that you configure PATH varies from OS to OS:


  *   MacOS use `/usr/libexec/java_home -v11`
  *   On linux use `sudo alternatives --config java`
  *   On Windows I have no idea

When you do this the `mvn` command will compute the value of JAVA_HOME for its 
own use; there is no need to explicitly set it yourself (unless other Java 
applications that you use insist u[on it).


Cheers,

Steve C

On 16 Jul 2022, at 7:24 am, Szymon Kuryło 
mailto:szymonkur...@gmail.com>> wrote:

Hello,

I'm trying to build a Java 11 Spark distro using the dev/make-distribution.sh 
script.
I have set JAVA_HOME to point to JDK11 location, I've also set the java.version 
property in pom.xml to 11, effectively also setting `maven.compile.source` and 
`maven.compile.target`.
When inspecting classes from the `dist` directory with `javap -v`, I find that 
the class major version is 52, which is specific to JDK8. Am I missing 
something? Is there a reliable way to set the JDK used in the build process?

Thanks,
Szymon K.

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Retrieve the count of spark nodes

2022-06-08 Thread Stephen Coy
Hi there,

We use something like:


/*
 * Force Spark to initialise the defaultParallelism by executing a dummy 
parallel operation and then return
 * the resulting defaultParallelism.
 */
private int getWorkerCount(SparkContext sparkContext) {
sparkContext.parallelize(List.of(1, 2, 3, 4)).collect();
return sparkContext.defaultParallelism();
}


Its useful for setting certain pool sizes dynamically, such as:


sparkContext.hadoopConfiguration().set("fs.s3a.connection.maximum", 
Integer.toString(workerCount * 2));

This  works in our Spark 3.0.1 code; just migrating to 3.2.1 now.

Cheers,

Steve C

On 8 Jun 2022, at 4:28 pm, Poorna Murali 
mailto:poornamur...@gmail.com>> wrote:

Hi,

I would like to know if it is possible to  get the count of live master and 
worker spark nodes running in a system.

Please help to clarify the same.

Thanks,
Poorna

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Using Avro file format with SparkSQL

2022-02-14 Thread Stephen Coy
Hi Morven,

We use —packages for all of our spark jobs. Spark downloads the specified jar 
and all of its dependencies from a Maven repository.

This means we never have to build fat or uber jars.

It does mean that the Apache Ivy configuration has to be set up correctly 
though.

Cheers,

Steve C

> On 15 Feb 2022, at 5:58 pm, Morven Huang  wrote:
>
> I wrote a toy spark job and ran it within my IDE, same error if I don’t add 
> spark-avro to my pom.xml. After putting spark-avro dependency to my pom.xml, 
> everything works fine.
>
> Another thing is, if my memory serves me right, the spark-submit options for 
> extra jars is ‘--jars’ , not ‘--packages’.
>
> Regards,
>
> Morven Huang
>
>
> On 2022/02/10 03:25:28 "Karanika, Anna" wrote:
>> Hello,
>>
>> I have been trying to use spark SQL’s operations that are related to the 
>> Avro file format,
>> e.g., stored as, save, load, in a Java class but they keep failing with the 
>> following stack trace:
>>
>> Exception in thread "main" org.apache.spark.sql.AnalysisException:  Failed 
>> to find data source: avro. Avro is built-in but external data source module 
>> since Spark 2.4. Please deploy the application as per the deployment section 
>> of "Apache Avro Data Source Guide".
>>at 
>> org.apache.spark.sql.errors.QueryCompilationErrors$.failedToFindAvroDataSourceError(QueryCompilationErrors.scala:1032)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
>>at 
>> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
>>at 
>> org.apache.spark.sql.DataFrameWriter.lookupV2Provider(DataFrameWriter.scala:852)
>>at 
>> org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:256)
>>at 
>> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239)
>>at xsys.fileformats.SparkSQLvsAvro.main(SparkSQLvsAvro.java:57)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
>> Method)
>>at 
>> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:64)
>>at 
>> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>at java.base/java.lang.reflect.Method.invoke(Method.java:564)
>>at 
>> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>>at 
>> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>>at 
>> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>>at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>>at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>>at 
>> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>>at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>>at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> For context, I am invoking spark-submit and adding arguments --packages 
>> org.apache.spark:spark-avro_2.12:3.2.0.
>> Yet, Spark responds as if the dependency was not added.
>> I am running spark-v3.2.0 (Scala 2.12).
>>
>> On the other hand, everything works great with spark-shell or spark-sql.
>>
>> I would appreciate any advice or feedback to get this running.
>>
>> Thank you,
>> Anna
>>
>>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Migration to Spark 3.2

2022-01-27 Thread Stephen Coy
Hi Aurélien,

Your Jackson versions look fine.

What happens if you change the scope of your Jackson dependencies to “provided”?

This should result in your application using the versions provided by Spark and 
avoid this potential collision.

Cheers,

Steve C

On 27 Jan 2022, at 9:48 pm, Aurélien Mazoyer 
mailto:aurel...@aepsilon.com>> wrote:

Hi Stephen,

Thank you for your answer!
Here it is, it seems that jackson dependencies are correct, no? :

Thanks,

[INFO] com.krrier:spark-lib-full:jar:0.0.1-SNAPSHOT
[INFO] +- com.krrier:backend:jar:0.0.1-SNAPSHOT:compile
[INFO] |  \- com.krrier:data:jar:0.0.1-SNAPSHOT:compile
[INFO] +- com.krrier:plugin-api:jar:0.0.1-SNAPSHOT:compile
[INFO] +- com.opencsv:opencsv:jar:4.2:compile
[INFO] |  +- org.apache.commons:commons-lang3:jar:3.9:compile
[INFO] |  +- org.apache.commons:commons-text:jar:1.3:compile
[INFO] |  +- commons-beanutils:commons-beanutils:jar:1.9.3:compile
[INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  \- org.apache.commons:commons-collections4:jar:4.1:compile
[INFO] +- org.apache.solr:solr-solrj:jar:7.4.0:compile
[INFO] |  +- org.apache.commons:commons-math3:jar:3.6.1:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.5.3:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.4.6:compile
[INFO] |  +- org.apache.httpcomponents:httpmime:jar:4.5.3:compile
[INFO] |  +- org.apache.zookeeper:zookeeper:jar:3.4.11:compile
[INFO] |  +- org.codehaus.woodstox:stax2-api:jar:3.1.4:compile
[INFO] |  +- org.codehaus.woodstox:woodstox-core-asl:jar:4.4.1:compile
[INFO] |  \- org.noggit:noggit:jar:0.8:compile
[INFO] +- com.databricks:spark-xml_2.12:jar:0.5.0:compile
[INFO] +- org.apache.tika:tika-parsers:jar:1.24:compile
[INFO] |  +- org.apache.tika:tika-core:jar:1.24:compile
[INFO] |  +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:compile
[INFO] |  |  +- jakarta.xml.bind:jakarta.xml.bind-api:jar:2.3.2:compile
[INFO] |  |  +- org.glassfish.jaxb:txw2:jar:2.3.2:compile
[INFO] |  |  +- com.sun.istack:istack-commons-runtime:jar:3.0.8:compile
[INFO] |  |  +- org.jvnet.staxex:stax-ex:jar:1.8.1:compile
[INFO] |  |  \- com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:compile
[INFO] |  +- com.sun.activation:jakarta.activation:jar:1.2.1:compile
[INFO] |  +- xerces:xercesImpl:jar:2.12.0:compile
[INFO] |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
[INFO] |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
[INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
[INFO] |  +- org.tallison:jmatio:jar:1.5:compile
[INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.3:compile
[INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.3:compile
[INFO] |  +- org.tukaani:xz:jar:1.8:compile
[INFO] |  +- com.epam:parso:jar:2.0.11:compile
[INFO] |  +- org.brotli:dec:jar:0.1.2:compile
[INFO] |  +- commons-codec:commons-codec:jar:1.13:compile
[INFO] |  +- org.apache.pdfbox:pdfbox:jar:2.0.19:compile
[INFO] |  |  \- org.apache.pdfbox:fontbox:jar:2.0.19:compile
[INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.19:compile
[INFO] |  +- org.apache.pdfbox:preflight:jar:2.0.19:compile
[INFO] |  |  \- org.apache.pdfbox:xmpbox:jar:2.0.19:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.16:compile
[INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.64:compile
[INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.64:compile
[INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.64:compile
[INFO] |  +- org.apache.poi:poi:jar:4.1.2:compile
[INFO] |  |  \- com.zaxxer:SparseBitSet:jar:1.2:compile
[INFO] |  +- org.apache.poi:poi-scratchpad:jar:4.1.2:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:compile
[INFO] |  +- com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:compile
[INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] |  +- org.ow2.asm:asm:jar:7.3.1:compile
[INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.22:compile
[INFO] |  +- org.tallison:metadata-extractor:jar:2.13.0:compile
[INFO] |  |  \- org.tallison.xmp:xmpcore-shaded:jar:6.1.10:compile
[INFO] |  | \- com.adobe.xmp:xmpcore:jar:6.1.10:compile
[INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] |  +- com.rometools:rome:jar:1.12.2:compile
[INFO] |  |  \- com.rometools:rome-utils:jar:1.12.2:compile
[INFO] |  +- org.gagravarr:vorbis-java-core:jar:0.8:compile
[INFO] |  +- 
com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] |  +- org.codelibs:jhighlight:jar:1.0.3:compile
[INFO] |  +- com.pff:java-libpst:jar:0.9.3:compile
[INFO] |  +- com.github.junrar:junrar:jar:4.0.0:compile
[INFO] |  +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.5:compile
[INFO] |  |  +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.5:compile
[INFO] |  |  +- org.apache.cxf:cxf-core:jar:3.3.5:compile
[INFO] |  |  |  +- com.fasterxml.woodstox:woodstox-core:jar:5.0.3:compile
[INFO] |  |  |  \- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.5:compile
[INFO] |  |  \- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.3.5:compile

Re: Migration to Spark 3.2

2022-01-26 Thread Stephen Coy
Hi Aurélien!

Please run

mvn dependency:tree

and check it for Jackson dependencies.

Feel free to respond with the output if you have any questions about it.

Cheers,

Steve C

> On 22 Jan 2022, at 10:49 am, Aurélien Mazoyer  wrote:
>
> Hello,
>
> I migrated my code to Spark 3.2 and I am facing some issues. When I run my 
> unit tests via Maven, I get this error:
> java.lang.NoClassDefFoundError: Could not initialize class 
> org.apache.spark.rdd.RDDOperationScope$
> which is not super nice.
>
> However, when I run my test via Intellij, I get the following one:
> java.lang.ExceptionInInitializerError
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
> at org.apache.spark.rdd.RDD.map(RDD.scala:421)
> ...
> Caused by: com.fasterxml.jackson.databind.JsonMappingException: Scala module 
> 2.12.3 requires Jackson Databind version >= 2.12.0 and < 2.13.0
> which is far better imo since it gives me some clue on what is missing in my 
> pom.xml file to make it work. After putting a few more dependencies, my tests 
> are again passing in intellij, but I am stuck on the same error when I am 
> running maven command :-/.
> It seems that jdk and maven versions are the same and both are using the same 
> .m2 directory.
> Any clue on what can be going wrong?
>
> Thank you,
>
> Aurelien

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi Sean,

I have had a more detailed look at what Spark is doing with log4 APIs and at 
this point I suspect that a logj 2.x migration might be more appropriate at the 
code level.

That still does not solve the libraries issue though. That would need more 
investigation.

I could be tempted to tackle it if there is enough interest.

Cheers,

Steve C

On 10 Nov 2021, at 9:42 am, Sean Owen 
mailto:sro...@gmail.com>> wrote:

Yep that's what I tried, roughly - there is an old jira about it. The issue is 
that Spark does need to configure some concrete logging framework in a few 
cases, as do other libs, and that isn't what the shims cover. Could be possible 
now or with more cleverness but the simple thing didn't work out IIRC.

On Tue, Nov 9, 2021, 4:32 PM Stephen Coy 
mailto:s...@infomedia.com.au>> wrote:
Hi there,

It’s true that the preponderance of log4j 1.2.x in many existing live projects 
is kind of a pain in the butt.

But there is a solution.

1. Migrate all Spark code to use slf4j APIs;

2. Exclude log4j 1.2.x from any dependencies sucking it in;

3. Include the log4j-over-slf4j bridge jar and slf4j-api jars;

4. Choose your favourite modern logging implementation and add it as a 
“runtime" dependency together with it’s slf4j binding jar (if needed).

In fact in the short term you can replace steps 1 and 2 with "remove the log4j 
1.2.17 jar from the distribution" and it should still work.

The slf4j project also includes a commons-logging shim for capturing its output 
too.

FWIW, the slf4j project is run by one of the original log4j developers.

Cheers,

Steve C


On 9 Nov 2021, at 11:11 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

No plans that I know of. It's not that Spark uses it so much as its 
dependencies. I tried and failed to upgrade it a couple years ago. you are 
welcome to try, and open a PR if successful.

On Tue, Nov 9, 2021 at 6:09 AM Ajay Kumar 
mailto:ajay.praja...@gmail.com>> wrote:
Hi Team,
We wanted to send Spark executor logs to a centralized logging server using TCP 
Socket. I see that the spark log4j version is very old(1.2.17) and it does not 
support JSON logs over tcp sockets on containers.
I wanted to konw what is the plan for upgrading the log4j version to log4j2.
Thanks in advance.
Regards,
Ajay

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Re: Log4J 2 Support

2021-11-09 Thread Stephen Coy
Hi there,

It’s true that the preponderance of log4j 1.2.x in many existing live projects 
is kind of a pain in the butt.

But there is a solution.

1. Migrate all Spark code to use slf4j APIs;

2. Exclude log4j 1.2.x from any dependencies sucking it in;

3. Include the log4j-over-slf4j bridge jar and slf4j-api jars;

4. Choose your favourite modern logging implementation and add it as a 
“runtime" dependency together with it’s slf4j binding jar (if needed).

In fact in the short term you can replace steps 1 and 2 with "remove the log4j 
1.2.17 jar from the distribution" and it should still work.

The slf4j project also includes a commons-logging shim for capturing its output 
too.

FWIW, the slf4j project is run by one of the original log4j developers.

Cheers,

Steve C


On 9 Nov 2021, at 11:11 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

No plans that I know of. It's not that Spark uses it so much as its 
dependencies. I tried and failed to upgrade it a couple years ago. you are 
welcome to try, and open a PR if successful.

On Tue, Nov 9, 2021 at 6:09 AM Ajay Kumar 
mailto:ajay.praja...@gmail.com>> wrote:
Hi Team,
We wanted to send Spark executor logs to a centralized logging server using TCP 
Socket. I see that the spark log4j version is very old(1.2.17) and it does not 
support JSON logs over tcp sockets on containers.
I wanted to konw what is the plan for upgrading the log4j version to log4j2.
Thanks in advance.
Regards,
Ajay

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Missing module spark-hadoop-cloud in Maven central

2021-06-01 Thread Stephen Coy
I have been building Apache Spark from source just so I can get this dependency.


  1.  git checkout v3.1.1
  2.  dev/make-distribution.sh --name hadoop-cloud-3.2 --tgz -Pyarn 
-Phadoop-3.2  -Pyarn -Phadoop-cloud -Phive-thriftserver  -Dhadoop.version=3.2.0

It is kind of a nuisance having to do this though.

Steve C


On 31 May 2021, at 10:34 pm, Sean Owen 
mailto:sro...@gmail.com>> wrote:

I know it's not enabled by default when the binary artifacts are built, but not 
exactly sure why it's not built separately at all. It's almost a 
dependencies-only pom artifact, but there are two source files. Steve do you 
have an angle on that?

On Mon, May 31, 2021 at 5:37 AM Erik Torres 
mailto:etserr...@gmail.com>> wrote:
Hi,

I'm following this 
documentation
 to configure my Spark-based application to interact with Amazon S3. However, I 
cannot find the spark-hadoop-cloud module in Maven central for the 
non-commercial distribution of Apache Spark. From the documentation I would 
expect that I can get this module as a Maven dependency in my project. However, 
I ended up building the spark-hadoop-cloud module from the Spark's 
code.

Is this the expected way to setup the integration with Amazon S3? I think I'm 
missing something here.

Thanks in advance!

Erik

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: pyspark sql load with path of special character

2021-04-25 Thread Stephen Coy
It probably does not like the colons in the path name “…20:04:27+00:00/…”, 
especially if you’re running on a Windows box.

On 24 Apr 2021, at 1:29 am, Regin Quinoa 
mailto:sweatr...@gmail.com>> wrote:

Hi, I am using pyspark sql to load files into table following
```LOAD DATA LOCAL INPATH '/user/hive/warehouse/students' OVERWRITE INTO TABLE 
test_load;```
 
https://spark.apache.org/docs/latest/sql-ref-syntax-dml-load.html

It complains pyspark.sql.utils.AnalysisException: load data input path does not 
exist
 when the path string has timestamp in the directory structure like
XX/XX/2021-03-02T20:04:27+00:00/file.parquet

It works with path without timestamp. How to work it around?


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: How to make bucket listing faster while using S3 with wholeTextFile

2021-03-15 Thread Stephen Coy
Hi there,

At risk of stating the obvious, the first step is to ensure that your Spark 
application and S3 bucket are colocated in the same AWS region.

Steve C

On 16 Mar 2021, at 3:31 am, Alchemist 
mailto:alchemistsrivast...@gmail.com>> wrote:

How to optimize s3 list S3 file using wholeTextFile(): We are using 
wholeTextFile to read data from S3.  As per my understanding wholeTextFile 
first list files of given path.  Since we are using S3 as input source, then 
listing files in a bucket is single-threaded, the S3 API for listing the keys 
in a bucket only returns keys by chunks of 1000 per call.   Since we have at 
millions of files, we are making thousands API calls.  This listing make our 
processing very slow. How can we make listing of S3 faster?

Thanks,

Rachana

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Scala vs Python for ETL with Spark

2020-10-10 Thread Stephen Boesch
I agree with Wim's assessment of data engineering / ETL vs Data Science.
I wrote pipelines/frameworks for large companies and scala was a much
better choice. But for ad-hoc work interfacing directly with data science
experiments pyspark presents less friction.

On Sat, 10 Oct 2020 at 13:03, Mich Talebzadeh 
wrote:

> Many thanks everyone for their valuable contribution.
>
> We all started with Spark a few years ago where Scala was the talk of the
> town. I agree with the note that as long as Spark stayed nish and elite,
> then someone with Scala knowledge was attracting premiums. In fairness in
> 2014-2015, there was not much talk of Data Science input (I may be wrong).
> But the world has moved on so to speak. Python itself has been around
> a long time (long being relative here). Most people either knew UNIX Shell,
> C, Python or Perl or a combination of all these. I recall we had a director
> a few years ago who asked our Hadoop admin for root password to log in to
> the edge node. Later he became head of machine learning somewhere else and
> he loved C and Python. So Python was a gift in disguise. I think Python
> appeals to those who are very familiar with CLI and shell programming (Not
> GUI fan). As some members alluded to there are more people around with
> Python knowledge. Most managers choose Python as the unifying development
> tool because they feel comfortable with it. Frankly I have not seen a
> manager who feels at home with Scala. So in summary it is a bit
> disappointing to abandon Scala and switch to Python just for the sake of it.
>
> Disclaimer: These are opinions and not facts so to speak :)
>
> Cheers,
>
>
> Mich
>
>
>
>
>
>
> On Fri, 9 Oct 2020 at 21:56, Mich Talebzadeh 
> wrote:
>
>> I have come across occasions when the teams use Python with Spark for
>> ETL, for example processing data from S3 buckets into Snowflake with Spark.
>>
>> The only reason I think they are choosing Python as opposed to Scala is
>> because they are more familiar with Python. Since Spark is written in
>> Scala, itself is an indication of why I think Scala has an edge.
>>
>> I have not done one to one comparison of Spark with Scala vs Spark with
>> Python. I understand for data science purposes most libraries like
>> TensorFlow etc. are written in Python but I am at loss to understand the
>> validity of using Python with Spark for ETL purposes.
>>
>> These are my understanding but they are not facts so I would like to get
>> some informed views on this if I can?
>>
>> Many thanks,
>>
>> Mich
>>
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Unsubscribe

2020-08-26 Thread Stephen Coy
The instructions for all Apache mail lists are in the mail headers:


List-Unsubscribe: 



On 27 Aug 2020, at 7:49 am, Jeff Evans 
mailto:jeffrey.wayne.ev...@gmail.com>> wrote:

That is not how you unsubscribe.  See here for instructions: 
https://gist.github.com/jeff303/ba1906bb7bcb2f2501528a8bb1521b8e

On Wed, Aug 26, 2020, 4:22 PM Annabel Melongo 
mailto:melongo_anna...@yahoo.com.invalid>> 
wrote:
Please remove me from the mailing list

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia's privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: S3 read/write from PySpark

2020-08-11 Thread Stephen Coy
Hi there,

Also for the benefit of others, if you attempt to use any version of Hadoop > 
3.2.0 (such as 3.2.1), you will need to update the version of Google Guava used 
by Apache Spark to that consumed by Hadoop.

Hadoop 3.2.1 requires guava-27.0-jre.jar. The latest is guava-29.0-jre.jar 
which also works.

This guava update will resolve the java.lang.NoSuchMethodError issue.

Cheers,

Steve C

On 6 Aug 2020, at 6:06 pm, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

Thanks for your help. Problem solved, but I thought I should add something in 
case this problem is encountered by others.

Both responses are correct; BasicAWSCredentialsProvider is gone, but simply 
making the substitution leads to the traceback just below. 
java.lang.NoSuchMethodError: 'void 
com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
java.lang.Object, java.lang.Object)'
I found that changing all of the Hadoop packages from 3.3.0 to 3.2.0 *and* 
changing the options away from BasicAWSCredentialsProvider solved the problem.

Thanks.
Regards,



Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py",
 line 535, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/pyspark/sql/utils.py", 
line 131, in deco
return f(*a, **kw)
  File 
"/home/daniel/packages/spark-3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.csv.
: java.lang.NoSuchMethodError: 'void 
com.google.common.base.Preconditions.checkArgument(boolean, java.lang.String, 
java.lang.Object, java.lang.Object)'
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:893)
at org.apache.hadoop.fs.s3a.S3AUtils.lookupPassword(S3AUtils.java:869)
at 
org.apache.hadoop.fs.s3a.S3AUtils.getEncryptionAlgorithm(S3AUtils.java:1580)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:341)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)








On Thu, 6 Aug 2020 at 17:19, Stephen Coy 
mailto:s...@infomedia.com.au>> wrote:
Hi Daniel,

It looks like …BasicAWSCredentialsProvider has become 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.

However, the way that the username and password are provided appears to have 
changed so you will probably need to look in to that.

Cheers,

Steve C

On 6 Aug 2020, at 11:15 am, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

I am trying to read/write files to S3 from PySpark. The procedure that I have 
used is to download Spark, start PySpark with the hadoop-aws, guava, 
aws-java-sdk-bundle packages. The versions are explicitly specified by looking 
up the exact dependency versi

Re: S3 read/write from PySpark

2020-08-06 Thread Stephen Coy
Hi Daniel,

It looks like …BasicAWSCredentialsProvider has become 
org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider.

However, the way that the username and password are provided appears to have 
changed so you will probably need to look in to that.

Cheers,

Steve C

On 6 Aug 2020, at 11:15 am, Daniel Stojanov 
mailto:m...@danielstojanov.com>> wrote:

Hi,

I am trying to read/write files to S3 from PySpark. The procedure that I have 
used is to download Spark, start PySpark with the hadoop-aws, guava, 
aws-java-sdk-bundle packages. The versions are explicitly specified by looking 
up the exact dependency version on Maven. Allowing dependencies to be auto 
determined does not work. This procedure works for Spark 3.0.0 with Hadoop 2.7, 
but does not work for Hadoop 3.2. I get this exception:
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider


Can somebody point to a procedure for doing this using Spark bundled with 
Hadoop 3.2?

Regards,





To replicate:

Download Spark 3.0.0 with support for Hadoop 3.2.

Launch spark with:

./pyspark --packages 
org.apache.hadoop:hadoop-common:3.3.0,org.apache.hadoop:hadoop-client:3.3.0,org.apache.hadoop:hadoop-aws:3.3.0,com.amazonaws:aws-java-sdk-bundle:1.11.563,com.google.guava:guava:27.1-jre

Then run the following Python.

# Set these 4 parameters as appropriate.
aws_bucket = ""
aws_filename = ""
aws_access_key = ""
aws_secret_key = ""

spark._jsc.hadoopConfiguration().set("fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", 
"true")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark._jsc.hadoopConfiguration().set("fs.s3a.endpoint", 
"s3.amazonaws.com")
spark._jsc.hadoopConfiguration().set("fs.s3a.access.key", aws_access_key)
spark._jsc.hadoopConfiguration().set("fs.s3a.secret.key", aws_secret_key)
df = spark.read.option("header", 
"true").csv(f"s3a://{aws_bucket}/{aws_filename}")



Leads to this error message:



Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/pyspark/sql/readwriter.py",
 line 535, in csv
return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1305, in __call__
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/pyspark/sql/utils.py",
 line 131, in deco
return f(*a, **kw)
  File 
"/home/daniel/.local/share/Trash/files/spark-3.3.0.0-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o39.csv.
: java.io.IOException: From option fs.s3a.aws.credentials.provider 
java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider not found
at 
org.apache.hadoop.fs.s3a.S3AUtils.loadAWSProviderClasses(S3AUtils.java:645)
at 
org.apache.hadoop.fs.s3a.S3AUtils.buildAWSProviderList(S3AUtils.java:668)
at 
org.apache.hadoop.fs.s3a.S3AUtils.createAWSCredentialProviderSet(S3AUtils.java:619)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.bindAWSClient(S3AFileSystem.java:636)
at 
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:390)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
at 
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:46)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:361)
at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:279)
at 
org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:268)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:268)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:705)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

Re: Tab delimited csv import and empty columns

2020-08-05 Thread Stephen Coy
Hi Sean, German and others,

Setting the “nullValue” option (for parsing CSV at least) seems to be an 
exercise in futility.

When parsing the file, 
com.univocity.parsers.common.input.AbstractCharInputReader#getString contains 
the following logic:


String out;
if (len <= 0) {
   out = nullValue;
} else {
   out = new String(buffer, pos, len);
}

resulting in the nullValue being assigned to the column value if it has zero 
length, such as with an empty String.

Later, org.apache.spark.sql.catalyst.csv.UnivocityParser#nullSafeDatum is 
called on the column value:


if (datum == options.nullValue || datum == null) {
  if (!nullable) {
throw new RuntimeException(s"null value found but field $name is not 
nullable.")
  }
  null
} else {
  converter.apply(datum)
}

Therefore, the empty String is first converted to the nullValue, and then 
matched against the nullValue and, bingo, we get the literal null.

For now, the “.na.fill(“”)” addition to the code is doing the right thing for 
me.

Thanks for all the help.


Steve C


On 1 Aug 2020, at 1:40 am, Sean Owen 
mailto:sro...@gmail.com>> wrote:

Try setting nullValue to anything besides the empty string. Because its default 
is the empty string, empty strings become null by default.

On Fri, Jul 31, 2020 at 3:20 AM Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any 
that load a CSV text file and verify that an empty string maps to an empty 
string which I think is supposed to be the default behaviour because the 
“nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo 
mailto:gschiavonsp...@gmail.com>> wrote:

Hey,

I understand that your empty values in your CSV are "" , if so, try this option:

.option("emptyValue", "\"\"")

Hope it helps

On Thu, 30 Jul 2020 at 08:49, Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature_source=Internal_medium=Email_content=Driving%20Force>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/




Re: Tab delimited csv import and empty columns

2020-07-31 Thread Stephen Coy
That does not work.

This is Spark 3.0 by the way.

I have been looking at the Spark unit tests and there does not seem to be any 
that load a CSV text file and verify that an empty string maps to an empty 
string which I think is supposed to be the default behaviour because the 
“nullValue” option defaults to “".

Thanks anyway

Steve C

On 30 Jul 2020, at 10:01 pm, German Schiavon Matteo 
mailto:gschiavonsp...@gmail.com>> wrote:

Hey,

I understand that your empty values in your CSV are "" , if so, try this option:

.option("emptyValue", "\"\"")

Hope it helps

On Thu, 30 Jul 2020 at 08:49, Stephen Coy 
mailto:s...@infomedia.com.au.invalid>> wrote:
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]<https://www.infomedia.com.au/driving-force/?utm_campaign=200630%20Email%20Signature_source=Internal_medium=Email_content=Driving%20Force>

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/



Tab delimited csv import and empty columns

2020-07-30 Thread Stephen Coy
Hi there,

I’m trying to import a tab delimited file with:

Dataset catalogData = sparkSession
  .read()
  .option("sep", "\t")
  .option("header", "true")
  .csv(args[0])
  .cache();

This works great, except for the fact that any column that is empty is given 
the value null, when I need these values to be literal empty strings.

Is there any option combination that will achieve this?

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I just looked at the examples.
https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
These look v nice!  V concise yet flexible.  I like the ability to do
inline *side-effects.  *E.g. caching or printing or showDs()

package org.jetbrains.spark.api.examples
import org.apache.spark.sql.Row
import org.jetbrains.spark.api.*

fun main() {
withSpark {
val sd = dsOf(1, 2, 3)
sd.createOrReplaceTempView("ds")
spark.sql("select * from ds")
.withCached {
println("asList: ${toList()}")
println("asArray: ${toArray().contentToString()}")
this
}
.to()
.withCached {
println("typed collect: " + (collect() as
Array).contentToString())
println("type collectAsList: " + collectAsList())
}

dsOf(1, 2, 3)
.map { c(it, it + 1, it + 2) }
.to()
.select("_1")
.collectAsList()
.forEach { println(it) }
}
}


So that shows some of the niceness of kotlin: intuitive type conversion
`to`/`to` and `dsOf( list)`- and also the inlining of the side
effects. Overall concise and pleasant to read.


On Tue, 14 Jul 2020 at 12:18, Stephen Boesch  wrote:

> I started with scala/spark in 2012 and scala has been my go-to language
> for six years. But I heartily applaud this direction. Kotlin is more like a
> simplified Scala - with the benefits that brings - than a simplified java.
> I particularly like the simplified / streamlined collections classes.
>
> Really looking forward to this development.
>
> On Tue, 14 Jul 2020 at 10:42, Maria Khalusova  wrote:
>
>> Hi folks,
>>
>> We would love your feedback on the new Kotlin Spark API that we are
>> working on: https://github.com/JetBrains/kotlin-spark-api.
>>
>> Why Kotlin Spark API? Kotlin developers can already use Kotlin with the
>> existing Apache Spark Java API, however they cannot take full advantage of
>> Kotlin language features. With Kotlin Spark API, you can use Kotlin data
>> classes and lambda expressions.
>>
>> The API also adds some helpful extension functions. For example, you can
>> use `withCached` to perform arbitrary transformations on a Dataset and not
>> worry about the Dataset unpersisting at the end.
>>
>> If you like Kotlin and would like to try the API, we've prepared a Quick
>> Start Guide to help you set up all the needed dependencies in no time using
>> either Maven or Gradle:
>> https://github.com/JetBrains/kotlin-spark-api/blob/master/docs/quick-start-guide.md
>>
>> In the repo, you’ll also find a few code examples to get an idea of what
>> the API looks like:
>> https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
>>
>> We’d love to see your feedback in the project’s GitHub issues:
>> https://github.com/JetBrains/kotlin-spark-api/issues.
>>
>>
>> Thanks!
>>
>>
>>


Re: Kotlin Spark API

2020-07-14 Thread Stephen Boesch
I started with scala/spark in 2012 and scala has been my go-to language for
six years. But I heartily applaud this direction. Kotlin is more like a
simplified Scala - with the benefits that brings - than a simplified java.
I particularly like the simplified / streamlined collections classes.

Really looking forward to this development.

On Tue, 14 Jul 2020 at 10:42, Maria Khalusova  wrote:

> Hi folks,
>
> We would love your feedback on the new Kotlin Spark API that we are
> working on: https://github.com/JetBrains/kotlin-spark-api.
>
> Why Kotlin Spark API? Kotlin developers can already use Kotlin with the
> existing Apache Spark Java API, however they cannot take full advantage of
> Kotlin language features. With Kotlin Spark API, you can use Kotlin data
> classes and lambda expressions.
>
> The API also adds some helpful extension functions. For example, you can
> use `withCached` to perform arbitrary transformations on a Dataset and not
> worry about the Dataset unpersisting at the end.
>
> If you like Kotlin and would like to try the API, we've prepared a Quick
> Start Guide to help you set up all the needed dependencies in no time using
> either Maven or Gradle:
> https://github.com/JetBrains/kotlin-spark-api/blob/master/docs/quick-start-guide.md
>
> In the repo, you’ll also find a few code examples to get an idea of what
> the API looks like:
> https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/spark/api/examples
>
> We’d love to see your feedback in the project’s GitHub issues:
> https://github.com/JetBrains/kotlin-spark-api/issues.
>
>
> Thanks!
>
>
>


When does SparkContext.defaultParallelism have the correct value?

2020-07-06 Thread Stephen Coy
Hi there,

I have found that if I invoke

sparkContext.defaultParallelism()

too early it will not return the correct value;

For example, if I write this:

final JavaSparkContext sparkContext = new 
JavaSparkContext(sparkSession.sparkContext());
final int workerCount = sparkContext.defaultParallelism();

I will get some small number (which I can’t recall right now).

However, if I insert:

sparkContext.parallelize(List.of(1, 2, 3, 4)).collect()

between these two lines I get the expected value being something like 
node_count * node_core_count;

This seems like a hacky work around solution to me. Is there a better way to 
get this value initialised properly?

FWIW, I need this value to size a connection pool (fs.s3a.connection.maximum) 
correctly in a cluster independent way.

Thanks,

Steve C


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]
This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: java.lang.ClassNotFoundException for s3a comitter

2020-07-06 Thread Stephen Coy
Hi Steve,

While I understand your point regarding the mixing of Hadoop jars, this does 
not address the java.lang.ClassNotFoundException.

Prebuilt Apache Spark 3.0 builds are only available for Hadoop 2.7 or Hadoop 
3.2. Not Hadoop 3.1.

The only place that I have found that missing class is in the Spark 
“hadoop-cloud” source module, and currently the only way to get the jar 
containing it is to build it yourself. If any of the devs are listening it  
would be nice if this was included in the standard distribution. It has a 
sizeable chunk of a repackaged Jetty embedded in it which I find a bit odd.

But I am relatively new to this stuff so I could be wrong.

I am currently running Spark 3.0 clusters with no HDFS. Spark is set up like:

hadoopConfiguration.set("spark.hadoop.fs.s3a.committer.name", "directory");
hadoopConfiguration.set("spark.sql.sources.commitProtocolClass", 
"org.apache.spark.internal.io.cloud.PathOutputCommitProtocol");
hadoopConfiguration.set("spark.sql.parquet.output.committer.class", 
"org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter");
hadoopConfiguration.set("fs.s3a.connection.maximum", Integer.toString(coreCount 
* 2));

Querying and updating s3a data sources seems to be working ok.

Thanks,

Steve C

On 29 Jun 2020, at 10:34 pm, Steve Loughran 
mailto:ste...@cloudera.com.INVALID>> wrote:

you are going to need hadoop-3.1 on your classpath, with hadoop-aws and the 
same aws-sdk it was built with (1.11.something). Mixing hadoop JARs is doomed. 
using a different aws sdk jar is a bit risky, though more recent upgrades have 
all be fairly low stress

On Fri, 19 Jun 2020 at 05:39, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


[http://downloads.ifmsystems.com/data/marketing/images/signatures/driving-force-newsletter.jpg]

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: RDD-like API for entirely local workflows?

2020-07-04 Thread Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for
many use cases. I use it in conjunction with (and sometimes instead of)
pandas/pandasql due to its much wider ETL related capabilities. On the JVM
side it is an even more obvious choice - given there is no equivalent to
pandas and it has even better performance.

It is also a strong candidate due to the expressiveness of the sql dialect
including support for analytical/windowing functions.There is a latency
hit: on the order of a couple of seconds to start the SparkContext - but
pandas is not a high performance tool in any case.

i see that OpenRefine is implemented in Java so then Spark local should  be
a very good complement to it.


On Sat, 4 Jul 2020 at 08:17, Antonin Delpeuch (lists) <
li...@antonin.delpeuch.eu> wrote:

> Hi,
>
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
>
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
>
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
>
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
>
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
>
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
>
> Cheers,
> Antonin
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


unsubscribe

2020-06-28 Thread stephen



Re: Hey good looking toPandas ()

2020-06-19 Thread Stephen Boesch
afaik It has been there since  Spark 2.0 in 2015.   Not certain about Spark
1.5/1.6

On Thu, 18 Jun 2020 at 23:56, Anwar AliKhan 
wrote:

> I first ran the  command
> df.show()
>
> For sanity check of my dataFrame.
>
> I wasn't impressed with the display.
>
> I then ran
> df.toPandas() in Jupiter Notebook.
>
> Now the display is really good looking .
>
> Is toPandas() a new function which became available in Spark 3.0 ?
>
>
>
>
>
>


Re: java.lang.ClassNotFoundException for s3a comitter

2020-06-18 Thread Stephen Coy
Hi Murat Migdisoglu,

Unfortunately you need the secret sauce to resolve this.

It is necessary to check out the Apache Spark source code and build it with the 
right command line options. This is what I have been using:

dev/make-distribution.sh --name my-spark --tgz -Pyarn -Phadoop-3.2  -Pyarn 
-Phadoop-cloud -Dhadoop.version=3.2.1

This will add additional jars into the build.

Copy hadoop-aws-3.2.1.jar, hadoop-openstack-3.2.1.jar and 
spark-hadoop-cloud_2.12-3.0.0.jar into the “jars” directory of your Spark 
distribution. If you are paranoid you could copy/replace all the 
hadoop-*-3.2.1.jar files but I have not found that necessary.

You will also need to upgrade the version of guava that appears in the spark 
distro because Hadoop 3.2.1 bumped this from guava-14.0.1.jar to 
guava-27.0-jre.jar. Otherwise you will get runtime ClassNotFound exceptions.

I have been using this combo for many months now with the Spark 3.0 
pre-releases and it has been working great.

Cheers,

Steve C


On 19 Jun 2020, at 10:24 am, murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:

Hi all
I've upgraded my test cluster to spark 3 and change my comitter to directory 
and I still get this error.. The documentations are somehow obscure on that.
Do I need to add a third party jar to support new comitters?

java.lang.ClassNotFoundException: 
org.apache.spark.internal.io.cloud.PathOutputCommitProtocol


On Thu, Jun 18, 2020 at 1:35 AM murat migdisoglu 
mailto:murat.migdiso...@gmail.com>> wrote:
Hello all,
we have a hadoop cluster (using yarn) using  s3 as filesystem with s3guard is 
enabled.
We are using hadoop 3.2.1 with spark 2.4.5.

When I try to save a dataframe in parquet format, I get the following exception:
java.lang.ClassNotFoundException: 
com.hortonworks.spark.cloud.commit.PathOutputCommitProtocol

My relevant spark configurations are as following:
"hadoop.mapreduce.outputcommitter.factory.scheme.s3a":"org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory",
"fs.s3a.committer.name":
 "magic",
"fs.s3a.committer.magic.enabled": true,
"fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",

While spark streaming fails with the exception above, apache beam succeeds 
writing parquet files.
What might be the problem?

Thanks in advance


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare


--
"Talkers aren’t good doers. Rest assured that we’re going there to use our 
hands, not our tongues."
W. Shakespeare

This email contains confidential information of and is the copyright of 
Infomedia. It must not be forwarded, amended or disclosed without consent of 
the sender. If you received this message by mistake, please advise the sender 
and delete all copies. Security of transmission on the internet cannot be 
guaranteed, could be infected, intercepted, or corrupted and you should ensure 
you have suitable antivirus protection in place. By sending us your or any 
third party personal details, you consent to (or confirm you have obtained 
consent from such third parties) to Infomedia’s privacy policy. 
http://www.infomedia.com.au/privacy-policy/


Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
I neglected to include the rationale: the assumption is this will be a
repeatedly needed process thus a reusable method were helpful.  The
predicate/input rules that are supported will need to be flexible enough to
support the range of input data domains and use cases .  For my workflows
the predicates are typically sql's.

Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch :

> Hi Mich!
>I think you can combine the good/rejected into one method that
> internally:
>
>- Create good/rejected df's given an input df and input
>rules/predicates to apply to the df.
>- Create a third df containing the good rows and the rejected rows
>with the bad columns nulled out
>- Append/insert the two dfs into their respective hive good/exception
>tables
>- return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
>or maybe just the (combinedDf,exceptionsDf)
>
>
> Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>>
>> Hi,
>>
>> I have a Spark Scala program created and compiled with Maven. It works
>> fine. It basically does the following:
>>
>>
>>1. Reads an xml file from HDFS location
>>2. Creates a DF on top of what it reads
>>3. Creates a new DF with some columns renamed etc
>>4. Creates a new DF for rejected rows (incorrect value for a column)
>>5. Puts rejected data into Hive exception table
>>6. Puts valid rows into Hive main table
>>7. Nullifies the invalid rows by setting the invalid column to NULL
>>and puts the rows into the main Hive table
>>
>> These are currently performed in one method. Ideally I want to break this
>> down as follows:
>>
>>
>>1. A method to read the XML file and creates DF and a new DF on top
>>of previous DF
>>2. A method to create a DF on top of rejected rows using t
>>3. A method to put invalid rows into the exception table using tmp
>>table
>>4. A method to put the correct rows into the main table again using
>>tmp table
>>
>> I was wondering if this is correct approach?
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Modularising Spark/Scala program

2020-05-02 Thread Stephen Boesch
Hi Mich!
   I think you can combine the good/rejected into one method that
internally:

   - Create good/rejected df's given an input df and input rules/predicates
   to apply to the df.
   - Create a third df containing the good rows and the rejected rows with
   the bad columns nulled out
   - Append/insert the two dfs into their respective hive good/exception
   tables
   - return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)
   or maybe just the (combinedDf,exceptionsDf)


Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:

>
> Hi,
>
> I have a Spark Scala program created and compiled with Maven. It works
> fine. It basically does the following:
>
>
>1. Reads an xml file from HDFS location
>2. Creates a DF on top of what it reads
>3. Creates a new DF with some columns renamed etc
>4. Creates a new DF for rejected rows (incorrect value for a column)
>5. Puts rejected data into Hive exception table
>6. Puts valid rows into Hive main table
>7. Nullifies the invalid rows by setting the invalid column to NULL
>and puts the rows into the main Hive table
>
> These are currently performed in one method. Ideally I want to break this
> down as follows:
>
>
>1. A method to read the XML file and creates DF and a new DF on top of
>previous DF
>2. A method to create a DF on top of rejected rows using t
>3. A method to put invalid rows into the exception table using tmp
>table
>4. A method to put the correct rows into the main table again using
>tmp table
>
> I was wondering if this is correct approach?
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


Re: Going it alone.

2020-04-16 Thread Stephen Boesch
The warning signs were there from the first email sent from that person. I
wonder is there any way to deal with this more proactively.

Am Do., 16. Apr. 2020 um 10:54 Uhr schrieb Mich Talebzadeh <
mich.talebza...@gmail.com>:

> good for you. right move
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 16 Apr 2020 at 18:33, Sean Owen  wrote:
>
>> Absolutely unacceptable even if this were the only one. I'm contacting
>> INFRA right now.
>>
>> On Thu, Apr 16, 2020 at 11:57 AM Holden Karau 
>> wrote:
>>
>>> I want to be clear I believe the language in janethrope1s email is
>>> unacceptable for the mailing list and possibly a violation of the Apache
>>> code of conduct. I’m glad we don’t see messages like this often.
>>>
>>> I know this is a stressful time for many of us, but let’s try and do our
>>> best to not take it out on others.
>>>
>>> On Wed, Apr 15, 2020 at 11:46 PM Subash Prabakar <
>>> subashpraba...@gmail.com> wrote:
>>>
 Looks like he had a very bad appraisal this year.. Fun fact : the
 coming year would be too :)

 On Thu, 16 Apr 2020 at 12:07, Qi Kang  wrote:

> Well man, check your attitude, you’re way over the line
>
>
> On Apr 16, 2020, at 13:26, jane thorpe 
> wrote:
>
> F*U*C*K O*F*F
> C*U*N*T*S
>
>
> --
> On Thursday, 16 April 2020 Kelvin Qin  wrote:
>
> No wonder I said why I can't understand what the mail expresses, it
> feels like a joke……
>
>
>
>
>
> 在 2020-04-16 02:28:49,seemanto.ba...@nomura.com.INVALID 写道:
>
> Have we been tricked by a bot ?
>
>
> *From:* Matt Smith 
> *Sent:* Wednesday, April 15, 2020 2:23 PM
> *To:* jane thorpe
> *Cc:* dh.lo...@gmail.com; user@spark.apache.org; janethor...@aol.com;
> em...@yeikel.com
> *Subject:* Re: Going it alone.
>
>
> *CAUTION EXTERNAL EMAIL:* DO NOT CLICK ON LINKS OR OPEN ATTACHMENTS
> THAT ARE UNEXPECTED OR SENT FROM UNKNOWN SENDERS. IF IN DOUBT REPORT TO
> SPAM SUBMISSIONS.
>
> This is so entertaining.
>
>
> 1. Ask for help
>
> 2. Compare those you need help from to a lower order primate.
>
> 3. Claim you provided information you did not
>
> 4. Explain that providing any information would be "too revealing"
>
> 5. ???
>
>
> Can't wait to hear what comes next, but please keep it up.  This is a
> bright spot in my day.
>
>
>
> On Tue, Apr 14, 2020 at 4:47 PM jane thorpe <
> janethor...@aol.com.invalid> wrote:
>
> I did write a long email in response to you.
> But then I deleted it because I felt it would be too revealing.
>
>
>
>
> --
>
> On Tuesday, 14 April 2020 David Hesson  wrote:
>
> I want to know  if Spark is headed in my direction.
>
> You are implying  Spark could be.
>
>
>
> What direction are you headed in, exactly? I don't feel as if anything
> were implied when you were asked for use cases or what problem you are
> solving. You were asked to identify some use cases, of which you don't
> appear to have any.
>
>
> On Tue, Apr 14, 2020 at 4:49 PM jane thorpe <
> janethor...@aol.com.invalid> wrote:
>
> That's what  I want to know,  Use Cases.
> I am looking for  direction as I described and I want to know  if
> Spark is headed in my direction.
>
> You are implying  Spark could be.
>
> So tell me about the USE CASES and I'll do the rest.
> --
>
> On Tuesday, 14 April 2020 yeikel valdes  wrote:
>
> It depends on your use case. What are you trying to solve?
>
>
>
>  On Tue, 14 Apr 2020 15:36:50 -0400 *janethor...@aol.com.INVALID
>  *wrote 
>
> Hi,
>
> I consider myself to be quite good in Software Development especially
> using frameworks.
>
> I like to get my hands  dirty. I have spent the last few months
> understanding modern frameworks and architectures.
>
> I am looking to invest my energy in a product where I don't have to
> relying on the monkeys which occupy this space  we call software
> development.
>
> I have found one that meets my requirements.
>
> Would Apache Spark be a good Tool for me or  do I 

Re: IDE suitable for Spark

2020-04-07 Thread Stephen Boesch
I have been using  Idea for both scala/spark and pyspark projects since
2013. It required fair amount of fiddling that first year but has been
stable since early 2015.   For pyspark projects only Pycharm naturally also
works v well.

Am Di., 7. Apr. 2020 um 09:10 Uhr schrieb yeikel valdes :

>
> Zeppelin is not an IDE but a notebook.  It is helpful to experiment but it
> is missing a lot of the features that we expect from an IDE.
>
> Thanks for sharing though.
>
>  On Tue, 07 Apr 2020 04:45:33 -0400 * zahidr1...@gmail.com
>  * wrote 
>
> When I first logged on I asked if there was a suitable IDE for Spark.
> I did get a couple of responses.
> *Thanks.*
>
> I did actually find one which is suitable IDE for spark.
> That is  *Apache Zeppelin.*
>
> One of many reasons it is suitable for Apache Spark is.
> The  *up and running Stage* which involves typing *bin/zeppelin-daemon.sh
> start*
> Go to browser and type *http://localhost:8080 *
> That's it!
>
> Then to
> * Hit the ground running*
> There are also ready to go Apache Spark examples
> showing off the type of functionality one will be using in real life
> production.
>
> Zeppelin comes with  embedded Apache Spark  and scala as default
> interpreter with 20 + interpreters.
> I have gone on to discover there are a number of other advantages for real
> time production
> environment with Zeppelin offered up by other Apache Products.
>
> Backbutton.co.uk
> ¯\_(ツ)_/¯
> ♡۶Java♡۶RMI ♡۶
> Make Use Method {MUM}
> makeuse.org
> 
>
>
>


Re: [PySpark] How to write HFiles as an 'append' to the same directory?

2020-03-16 Thread Stephen Coy
I encountered a similar problem when trying to:

ds.write().save(“s3a://some-bucket/some/path/table”);

which writes the content as a bunch of parquet files in the “folder” named 
“table”.

I am using a Flintrock cluster with the Spark 3.0 preview FWIW.

Anyway, I just used the AWS SDK to remove it (and any “subdirectories”) before 
kicking off the spark machinery.

I can show you how to do this in Java, but I think the Python SDK maybe 
significantly different.

Steve C


On 15 Mar 2020, at 6:23 am, Gautham Acharya 
mailto:gauth...@alleninstitute.org>> wrote:

I have a process in Apache Spark that attempts to write HFiles to S3 in a 
batched process. I want the resulting HFiles in the same directory, as they are 
in the same column family. However, I’m getting a ‘directory already exists 
error’ when I try to run this on AWS EMR. How can I write Hfiles via Spark as 
an ‘append’, like I can do via a CSV?

The batch writing function looks like this:

for col_group in split_cols:
processed_chunk = 
batch_write_pandas_udf_for_col_aggregation(joined_dataframe, col_group, 
pandas_udf_func, group_by_args)

hfile_writer.write_hfiles(processed_chunk, output_path,
  zookeeper_ip, table_name, 
constants.DEFAULT_COL_FAMILY)

The actual function to write the Hfiles is this:

rdd.saveAsNewAPIHadoopFile(output_path,
   constants.OUTPUT_FORMAT_CLASS,
   keyClass=constants.KEY_CLASS,
   valueClass=constants.VALUE_CLASS,
   keyConverter=constants.KEY_CONVERTER,
   valueConverter=constants.VALUE_CONVERTER,
   conf=conf)


The exception I’m getting:


Called with arguments: Namespace(job_args=['matrix_path=/tmp/matrix.csv', 
'metadata_path=/tmp/metadata.csv', 
'output_path=s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles',
 'group_by_args=cluster_id', 'zookeeper_ip=ip-172-30-5-36.ec2.internal', 
'table_name=test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a'], 
job_name='matrix_transformations')

job_args_tuples: [['matrix_path', '/tmp/matrix.csv'], ['metadata_path', 
'/tmp/metadata.csv'], ['output_path', 
's3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles'],
 ['group_by_args', 'cluster_id'], ['zookeeper_ip', 
'ip-172-30-5-36.ec2.internal'], ['table_name', 
'test_wide_matrix__8c6b2f09-2ba0-42fc-af8e-656188c0186a']]

Traceback (most recent call last):

  File "/mnt/var/lib/hadoop/steps/s-2ZIOR335HH9TR/main.py", line 56, in 

job_module.transform(spark, **job_args)

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 93, in transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/jobs/matrix_transformations/job.py",
 line 73, in write_split_columnwise_transform

  File 
"/mnt/tmp/spark-745d355c-0a90-4992-a07c-bc1dde4fac3e/userFiles-33d65b1c-bdae-4b44-997e-4d4c7521ad96/dep.zip/src/spark_transforms/pyspark_jobs/output_handler/hfile_writer.py",
 line 44, in write_hfiles

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1438, in 
saveAsNewAPIHadoopFile

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
line 1257, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.saveAsNewAPIHadoopFile.

: org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory 
s3://gauthama-etl-pipelines-test-working-bucket/8c6b2f09-2ba0-42fc-af8e-656188c0186a/hfiles/median
 already exists

at 
org.apache.hadoop.mapreduce.lib.output.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:146)

at 
org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.assertConf(SparkHadoopWriter.scala:393)

at 
org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:71)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1083)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1081)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)

at 

Re: FYI: The evolution on `CHAR` type behavior

2020-03-16 Thread Stephen Coy
Hi there,

I’m kind of new around here, but I have had experience with all of all the so 
called “big iron” databases such as Oracle, IBM DB2 and Microsoft SQL Server as 
well as Postgresql.

They all support the notion of “ANSI padding” for CHAR columns - which means 
that such columns are always space padded, and they default to having this 
enabled (for ANSI compliance).

MySQL also supports it, but it defaults to leaving it disabled for historical 
reasons not unlike what we have here.

In my opinion we should push toward standards compliance where possible and 
then document where it cannot work.

If users don’t like the padding on CHAR columns then they should change to 
VARCHAR - I believe that was its purpose in the first place, and it does not 
dictate any sort of “padding".

I can see why you might “ban” the use of CHAR columns where they cannot be 
consistently supported, but VARCHAR is a different animal and I would expect it 
to work consistently everywhere.


Cheers,

Steve C

On 17 Mar 2020, at 10:01 am, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:

Hi, Reynold.
(And +Michael Armbrust)

If you think so, do you think it's okay that we change the return value 
silently? Then, I'm wondering why we reverted `TRIM` functions then?

> Are we sure "not padding" is "incorrect"?

Bests,
Dongjoon.


On Sun, Mar 15, 2020 at 11:15 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

100% agree with Reynold.


Regards,
Gourav Sengupta

On Mon, Mar 16, 2020 at 3:31 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Are we sure "not padding" is "incorrect"?

I don't know whether ANSI SQL actually requires padding, but plenty of 
databases don't actually pad.

https://docs.snowflake.net/manuals/sql-reference/data-types-text.html
 : "Snowflake currently deviates from common CHAR semantics in that strings 
shorter than the maximum length are not space-padded at the end."

MySQL: 
https://stackoverflow.com/questions/53528645/why-char-dont-have-padding-in-mysql








On Sun, Mar 15, 2020 at 7:02 PM, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Reynold.

Please see the following for the context.

https://issues.apache.org/jira/browse/SPARK-31136
"Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax"

I raised the above issue according to the new rubric, and the banning was the 
proposed alternative to reduce the potential issue.

Please give us your opinion since it's still PR.

Bests,
Dongjoon.

On Sat, Mar 14, 2020 at 17:54 Reynold Xin 
mailto:r...@databricks.com>> wrote:
I don’t understand this change. Wouldn’t this “ban” confuse the hell out of 
both new and old users?

For old users, their old code that was working for char(3) would now stop 
working.

For new users, depending on whether the underlying metastore char(3) is either 
supported but different from ansi Sql (which is not that big of a deal if we 
explain it) or not supported.

On Sat, Mar 14, 2020 at 3:51 PM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Apache Spark has been suffered from a known consistency issue on `CHAR` type 
behavior among its usages and configurations. However, the evolution direction 
has been gradually moving forward to be consistent inside Apache Spark because 
we don't have `CHAR` offically. The following is the summary.

With 1.6.x ~ 2.3.x, `STORED PARQUET` has the following different result.
(`spark.sql.hive.convertMetastoreParquet=false` provides a fallback to Hive 
behavior.)

spark-sql> CREATE TABLE t1(a CHAR(3));
spark-sql> CREATE TABLE t2(a CHAR(3)) STORED AS ORC;
spark-sql> CREATE TABLE t3(a CHAR(3)) STORED AS PARQUET;

spark-sql> INSERT INTO TABLE t1 SELECT 'a ';
spark-sql> INSERT INTO TABLE t2 SELECT 'a ';
spark-sql> INSERT INTO TABLE t3 SELECT 'a ';

spark-sql> SELECT a, length(a) FROM 

Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Ok. Can't think of why that would happen.

Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> As mentioned in the very first mail:
> * same cluster it is submitted.
> * from same machine they are submitted and also from same user
> * each of them has 128 executors and 2 cores per executor with 8Gigs of
> memory each and both of them are getting that while running
>
> to clarify more let me quote what I mentioned above. *These data is taken
> from Spark-UI when the jobs are almost finished in both.*
> "What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which
> means per task time taken is much higher in spark-submit script than
> jupyter script. This is where I am really puzzled because they are the
> exact same code. why running them two different ways vary so much in the
> execution time.
>
>
>
>
> *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028*
>
>
> On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch  wrote:
>
>> Sounds like you have done your homework to properly compare .   I'm
>> guessing the answer to the following is yes .. but in any case:  are they
>> both running against the same spark cluster with the same configuration
>> parameters especially executor memory and number of workers?
>>
>> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
>> dhruba.w...@gmail.com>:
>>
>>> No, i checked for that, hence written "brand new" jupyter notebook. Also
>>> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
>>> compressed base64 encoded text data from a hive table and decompressing and
>>> decoding in one of the udfs. Also the time compared is from Spark UI not
>>> how long the job actually takes after submission. Its just the running time
>>> i am comparing/mentioning.
>>>
>>> As mentioned earlier, all the spark conf params even match in two
>>> scripts and that's why i am puzzled what going on.
>>>
>>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <
>>> pmccar...@dstillery.com> wrote:
>>>
>>>> It's not obvious from what you pasted, but perhaps the juypter notebook
>>>> already is connected to a running spark context, while spark-submit needs
>>>> to get a new spot in the (YARN?) queue.
>>>>
>>>> I would check the cluster job IDs for both to ensure you're getting new
>>>> cluster tasks for each.
>>>>
>>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am facing a weird behaviour while running a python script. Here is
>>>>> what the code looks like mostly:
>>>>>
>>>>> def fn1(ip):
>>>>>some code...
>>>>> ...
>>>>>
>>>>> def fn2(row):
>>>>> ...
>>>>> some operations
>>>>> ...
>>>>> return row1
>>>>>
>>>>>
>>>>> udf_fn1 = udf(fn1)
>>>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>>>> ~4500 partitions
>>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>>>> .drop("colz") \
>>>>> .withColumnRenamed("colz", "coly")
>>>>>
>>>>> edf = ddf \
>>>>> .filter(ddf.colp == 'some_value') \
>>>>> .rdd.map(lambda row: fn2(row)) \
>>>>> .toDF()
>>>>>
>>>>> print edf.count() // simple way for the performance test in both
>>>>> platforms
>>>>>
>>>>> Now when I run the same code in a brand new jupyter notebook it runs
>>>>> 6x faster than when I run this python script using spark-submit. The
>>>>> configurations are printed and  compared from both the platforms and they
>>>>> are exact same. I even tried to run this script in a single cell of 
>>>>> jupyter
>>>>> notebook and still have the same performance. I need to understand if I am
>>>>> missing something in the spark-submit which is causing the issue.  I tried
>>>>> to minimise the script to reproduce the same error without much code.
>>>>>
>>>>> Both are run in client mode on a yarn based spark cluster. The
>>>>> machines from which both are executed are also the same and from same 
>>>>> user.
>>>>>
>>>>> What i found is the  the quantile values for median for one ran with
>>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am 
>>>>> not
>>>>> able to figure out why this is happening.
>>>>>
>>>>> Any one faced this kind of issue before or know how to resolve this?
>>>>>
>>>>> *Regards,*
>>>>> *Dhrub*
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>> *Patrick McCarthy  *
>>>>
>>>> Senior Data Scientist, Machine Learning Engineering
>>>>
>>>> Dstillery
>>>>
>>>> 470 Park Ave South, 17th Floor, NYC 10016
>>>>
>>>


Re: script running in jupyter 6-7x faster than spark submit

2019-09-10 Thread Stephen Boesch
Sounds like you have done your homework to properly compare .   I'm
guessing the answer to the following is yes .. but in any case:  are they
both running against the same spark cluster with the same configuration
parameters especially executor memory and number of workers?

Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati <
dhruba.w...@gmail.com>:

> No, i checked for that, hence written "brand new" jupyter notebook. Also
> the time taken by both are 30 mins and ~3hrs as i am reading a 500  gigs
> compressed base64 encoded text data from a hive table and decompressing and
> decoding in one of the udfs. Also the time compared is from Spark UI not
> how long the job actually takes after submission. Its just the running time
> i am comparing/mentioning.
>
> As mentioned earlier, all the spark conf params even match in two scripts
> and that's why i am puzzled what going on.
>
> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, 
> wrote:
>
>> It's not obvious from what you pasted, but perhaps the juypter notebook
>> already is connected to a running spark context, while spark-submit needs
>> to get a new spot in the (YARN?) queue.
>>
>> I would check the cluster job IDs for both to ensure you're getting new
>> cluster tasks for each.
>>
>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati 
>> wrote:
>>
>>> Hi,
>>>
>>> I am facing a weird behaviour while running a python script. Here is
>>> what the code looks like mostly:
>>>
>>> def fn1(ip):
>>>some code...
>>> ...
>>>
>>> def fn2(row):
>>> ...
>>> some operations
>>> ...
>>> return row1
>>>
>>>
>>> udf_fn1 = udf(fn1)
>>> cdf = spark.read.table("") //hive table is of size > 500 Gigs with
>>> ~4500 partitions
>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>>> .drop("colz") \
>>> .withColumnRenamed("colz", "coly")
>>>
>>> edf = ddf \
>>> .filter(ddf.colp == 'some_value') \
>>> .rdd.map(lambda row: fn2(row)) \
>>> .toDF()
>>>
>>> print edf.count() // simple way for the performance test in both
>>> platforms
>>>
>>> Now when I run the same code in a brand new jupyter notebook it runs 6x
>>> faster than when I run this python script using spark-submit. The
>>> configurations are printed and  compared from both the platforms and they
>>> are exact same. I even tried to run this script in a single cell of jupyter
>>> notebook and still have the same performance. I need to understand if I am
>>> missing something in the spark-submit which is causing the issue.  I tried
>>> to minimise the script to reproduce the same error without much code.
>>>
>>> Both are run in client mode on a yarn based spark cluster. The machines
>>> from which both are executed are also the same and from same user.
>>>
>>> What i found is the  the quantile values for median for one ran with
>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
>>> able to figure out why this is happening.
>>>
>>> Any one faced this kind of issue before or know how to resolve this?
>>>
>>> *Regards,*
>>> *Dhrub*
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>


Spark on YARN with private Docker repositories/registries

2019-08-16 Thread Tak-Lon (Stephen) Wu
Hi guys,

Have anyone been using spark (spark-submit) with yarn mode which pull
images from a private Docker repositories/registries ??

how do you pass in the docker config.json which included the auth tokens ?
or is there any environment variable can be added in the system environment
to make it load from it by default?

Thanks,
Stephen


Re: Incremental (online) machine learning algorithms on ML

2019-08-05 Thread Stephen Boesch
There are several high bars to getting a new algorithm adopted.

*  It needs to be deemed by the MLLib committers/shepherds as widely useful
to the community.  Algorithms offered by larger companies after having
demonstrated usefulness at scale for   use cases  likely to be encountered
by many other companies stand a better chance
* There is quite limited bandwidth for consideration of new algorithms:
there has been a dearth of new ones accepted since early 2015 . So
prioritization is a challenge.
* The code must demonstrate high quality standards especially wrt
testability, maintainability, computational performance, and scalability.
* The chosen algorithms and options should be well documented and include
comparisons/ tradeoffs with state of the art described in relevant papers.
These questions will typically be asked during design/code reviews - i.e.
did you consider the approach shown *here *
* There is also luck and timing involved. The review process might start in
a given month A but reviewers become busy or higher priorities intervene ..
and then when will the reviewing continue..
* At the point that the above are complete then there are intricacies with
integrating with a particular Spark release

Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas :

> Hi,
>
> After searching the machine learning library for streaming algorithms, I
> found two that fit the criteria: Streaming Linear Regression
> (
> https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression)
>
> and Streaming K-Means
> (
> https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means
> ).
>
> However, both use the RDD-based API MLlib instead of the DataFrame-based
> API ML; are there any plans for bringing them both to ML?
>
> Also, is there any technical reason why there are so few incremental
> algorithms on the machine learning library? There's only 1 algorithm for
> regression and clustering each, with nothing for classification,
> dimensionality reduction or feature extraction.
>
> If there is a reason, how were those two algorithms implemented? If
> there isn't, what is the general consensus on adding new online machine
> learning algorithms?
>
> Regards,
> Lucas Chagas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-20 Thread Stephen Boesch
Consider the following *intended* sql:

select row_number()
  over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,*
  from flights

This will *not* work in *structured streaming* : The culprit is:

 partition by Origin

The requirement is to use a timestamp-typed field such as

 partition by flightTime

Tathagata Das (core committer for *spark streaming*) - replies on that in a
nabble thread:

 The traditional SQL windows with `over` is not supported in streaming.
Only time-based windows, that is, `window("timestamp", "10 minutes")` is
supported in streaming

*W**hat then* for my query above - which *must* be based on the *Origin* field?
What is the closest equivalent to that query? Or what would be a workaround
or different approach to achieve same results?


Re: spark-sklearn

2019-04-08 Thread Stephen Boesch
There are several suggestions on this SOF
https://stackoverflow.com/questions/38984775/spark-errorexpected-zero-arguments-for-construction-of-classdict-for-numpy-cor

1

You need to convert the final value to a python list. You implement the
function as follows:

def uniq_array(col_array):
x = np.unique(col_array)
return list(x)

This is because Spark doesn't understand the numpy array format. In order
to feed a python object that Spark DataFrames understand as an ArrayType,
you need to convert the output to a python list before returning it.




The source of the problem is that object returned from the UDF doesn't
conform to the declared type. np.unique not only returns numpy.ndarray but
also converts numerics to the corresponding NumPy types which are not
compatible  with
DataFrame API. You can try something like this:

udf(lambda x: list(set(x)), ArrayType(IntegerType()))

or this (to keep order)

udf(lambda xs: list(OrderedDict((x, None) for x in xs)),
ArrayType(IntegerType()))

instead.

If you really want np.unique you have to convert the output:

udf(lambda x: np.unique(x).tolist(), ArrayType(IntegerType()))













Am Mo., 8. Apr. 2019 um 11:43 Uhr schrieb Sudhir Babu Pothineni <
sbpothin...@gmail.com>:

>
>
>
> Trying to run tests in spark-sklearn, anybody check the below exception
>
> pip freeze:
>
> nose==1.3.7
> numpy==1.16.1
> pandas==0.19.2
> python-dateutil==2.7.5
> pytz==2018.9
> scikit-learn==0.19.2
> scipy==1.2.0
> six==1.12.0
> spark-sklearn==0.3.0
>
> Spark version:
> spark-2.2.3-bin-hadoop2.6/bin/pyspark
>
>
> running into following exception:
>
> ==
> ERROR: test_scipy_sparse (spark_sklearn.converter_test.CSRVectorUDTTests)
> --
> Traceback (most recent call last):
>   File
> "/home/spothineni/Downloads/spark-sklearn-release-0.3.0/python/spark_sklearn/converter_test.py",
> line 83, in test_scipy_sparse
> self.assertEqual(df.count(), 1)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/pyspark/sql/dataframe.py",
> line 522, in count
> return int(self._jdf.count())
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 1257, in __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/pyspark/sql/utils.py",
> line 63, in deco
> return f(*a, **kw)
>   File
> "/home/spothineni/Downloads/spark-2.4.1-bin-hadoop2.6/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
> line 328, in get_return_value
> format(target_id, ".", name), value)
> Py4JJavaError: An error occurred while calling o652.count.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 11 in stage 0.0 failed 1 times, most recent failure: Lost task 11.0 in
> stage 0.0 (TID 11, localhost, executor driver):
> net.razorvine.pickle.PickleException: expected zero arguments for
> construction of ClassDict (for numpy.dtype)
> at
> net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)
> at
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:188)
> at
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:187)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.agg_doAggregateWithoutKey_0$(Unknown
> Source)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> at org.apache.spark.scheduler.Task.run(Task.scala:121)
> at
> 

Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
You might have better luck downloading the 2.4.X branch

Am Di., 12. März 2019 um 16:39 Uhr schrieb swastik mittal :

> Then are the mlib of spark compatible with scala 2.12? Or can I change the
> spark version from spark3.0 to 2.3 or 2.4 in local spark/master?
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Build spark source code with scala 2.11

2019-03-12 Thread Stephen Boesch
I think scala 2.11 support was removed with the spark3.0/master

Am Di., 12. März 2019 um 16:26 Uhr schrieb swastik mittal :

> I am trying to build my spark using build/sbt package, after changing the
> scala versions to 2.11 in pom.xml because my applications jar files use
> scala 2.11. But building the spark code gives an error in sql  saying "A
> method with a varargs annotation produces a forwarder method with the same
> signature (exprs:
> Array[org.apache.spark.sql.Column])org.apache.spark.sql.Column as an
> existing method." in UserDefinedFunction.scala. I even tried building with
> using Dscala parameter to change the version of scala but it gives the same
> error. How do I change the spark and scala version and build the spark
> source code correctly? Any help is appreciated.
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Classic logistic regression missing !!! (Generalized linear models)

2018-10-11 Thread Stephen Boesch
So the LogisticRegression with regParam and elasticNetParam set to 0 is not
what you are looking for?

https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#logistic-regression

  .setRegParam(0.0)
  .setElasticNetParam(0.0)


Am Do., 11. Okt. 2018 um 15:46 Uhr schrieb pikufolgado <
pikufolg...@gmail.com>:

> Hi,
>
> I would like to carry out a classic logistic regression analysis. In other
> words, without using penalised regression ("glmnet" in R). I have read the
> documentation and am not able to find this kind of models.
>
> Is it possible to estimate this? In R the name of the function is "glm".
>
> Best regards
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Fixing NullType for parquet files

2018-09-12 Thread Stephen Boesch
When this JIRA was opened in 2015 the parquet did not support null types.
I commented on this JIRA in May that - given parquet now does include that
support - can this bug be reopened ?  There was no response. What is the
correct way to request consideration of re-opening this issue?

https://issues.apache.org/jira/browse/SPARK-10943

Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=14959304=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14959304>
[image: marmbrus]Michael Armbrust
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=marmbrus> added
a comment - 15/Oct/15 18:00

Yeah, parquet doesn't have a concept of null type. I'd probably suggest
they case null to a type CAST(NULL AS INT) if they really want to do this,
but really you should just omit the column probably.
<https://issues.apache.org/jira/browse/SPARK-10943#>
Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462244=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462244>
[image: wabu]Daniel Davis
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=wabu> added a
comment - 03/May/18 10:14

According to parquet data types
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md, now a
Null type should be supported. So perhaps this issue should be reconsidered?
<https://issues.apache.org/jira/browse/SPARK-10943#>
Permalink
<https://issues.apache.org/jira/browse/SPARK-10943?focusedCommentId=16462797=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16462797>
[image: javadba]Stephen Boesch
<https://issues.apache.org/jira/secure/ViewProfile.jspa?name=javadba> added
a comment - 03/May/18 17:08

Given the comment by Daniel Davis can this issue be reopened?


Re: [announce] BeakerX supports Scala+Spark in Jupyter

2018-06-07 Thread Stephen Boesch
Assuming that the spark 2.X kernel (e.g. toree) were chosen for a given
jupyter notebook and there is a  Cell 3 that contains some Spark DataFrame
operations .. Then :


   - what is the relationship  does the %%spark  magic and the toree kernel?
   - how does the %%spark magic get applied to that other Cell 3 ?

thanks!

2018-06-07 16:33 GMT-07:00 s...@draves.org :

> We are pleased to announce release 0.19.0 of BeakerX ,
> a collection of extensions and kernels for Jupyter and Jupyter Lab.
>
> BeakerX now features Scala+Spark integration including GUI configuration,
> status, progress, interrupt, and interactive tables.
>
> We are very interested in your feedback about what remains to be done.
> You may reach by github and gitter, as documented in the readme:
> https://github.com/twosigma/beakerx
>
> Thanks, -Scott
>
> [image: spark.png]
> ​
>
> --
> BeakerX.com
> ScottDraves.com 
> @Scott_Draves 
>
>


Re: Guava dependency issue

2018-05-08 Thread Stephen Boesch
I downgraded to spark 2.0.1 and it fixed that *particular *runtime
exception: but then a similar one appears when saving to parquet:

An  SOF question on this was created a month ago and today further details plus
an open bounty were added to it:

https://stackoverflow.com/questions/49713485/spark-error-with-google-guava-library-java-lang-nosuchmethoderror-com-google-c

The new but similar exception is shown below:

The hack to downgrade to 2.0.1 does help - i.e. execution proceeds *further* :
but then when writing out to *parquet* the above error does happen.

8/05/07 11:26:11 ERROR Executor: Exception in task 0.0 in stage 2741.0
(TID 2618)
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at org.apache.hadoop.io.compress.CodecPool.(CodecPool.java:74)
at 
org.apache.parquet.hadoop.CodecFactory$BytesCompressor.(CodecFactory.java:92)
at 
org.apache.parquet.hadoop.CodecFactory.getCompressor(CodecFactory.java:169)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:303)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetFileFormat.scala:562)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:139)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.newOutputWriter(WriterContainer.scala:131)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:247)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:6



2018-05-07 10:30 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> I am intermittently running into guava dependency issues across mutiple
> spark projects.  I have tried maven shade / relocate but it does not
> resolve the issues.
>
> The current project is extremely simple: *no* additional dependencies
> beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
> clean was re-applied).
>
> Is there a reliable approach to handling the versioning for guava within
> spark dependency projects?
>
>
> [INFO] 
> 
> [INFO] Building ccapps_final 1.0-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
> 18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> [WARNING]
> java.lang.NoSuchMethodError: com.google.common.cache.CacheBuilder.
> refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/
> google/common/cache/CacheBuilder;
> at org.apache.hadoop.security.Groups.(Groups.java:96)
> at org.apache.hadoop.security.Groups.(Groups.java:73)
> at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(
> Groups.java:293)
> at org.apache.hadoop.security.UserGroupInformation.initialize(
> UserGroupInformation.java:283)
> at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(
> UserGroupInformation.java:260)
> at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(
> UserGroupInformation.java:789)
> at org.apache.hadoop.security.UserGroupInformation.getLoginUser(
> UserGroupInformation.java:774)
> at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(
> UserGroupInformation.java:647)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.
> apply(Utils.scala:2424)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
> at org.apache.spark.SparkContext.(SparkContext.scala:295)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
&g

Guava dependency issue

2018-05-07 Thread Stephen Boesch
I am intermittently running into guava dependency issues across mutiple
spark projects.  I have tried maven shade / relocate but it does not
resolve the issues.

The current project is extremely simple: *no* additional dependencies
beyond scala, spark, and scalatest - yet the issues remain (and yes mvn
clean was re-applied).

Is there a reliable approach to handling the versioning for guava within
spark dependency projects?


[INFO]

[INFO] Building ccapps_final 1.0-SNAPSHOT
[INFO]

[INFO]
[INFO] --- exec-maven-plugin:1.6.0:java (default-cli) @ ccapps_final ---
18/05/07 10:24:00 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
[WARNING]
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.refreshAfterWrite(JLjava/util/concurrent/TimeUnit;)Lcom/google/common/cache/CacheBuilder;
at org.apache.hadoop.security.Groups.(Groups.java:96)
at org.apache.hadoop.security.Groups.(Groups.java:73)
at
org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:293)
at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:283)
at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:260)
at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:789)
at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:774)
at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:647)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at
org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2424)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2424)
at org.apache.spark.SparkContext.(SparkContext.scala:295)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at
org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at
org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at luisvictorsteve.SlackSentiment$.getSpark(SlackSentiment.scala:10)
at luisvictorsteve.SlackSentiment$.main(SlackSentiment.scala:16)
at luisvictorsteve.SlackSentiment.main(SlackSentiment.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
at java.lang.Thread.run(Thread.java:748)
[INFO]



Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
While it is certainly possible to use VM I have seen in a number of places
warnings that collect() results must be able to be fit in memory. I'm not
sure if that applies to *all" spark calculations: but in the very least
each of the specific collect()'s that are performed would need to be
verified.

And maybe *all *collects do require sufficient memory - would you like to
check the source code to see if there were disk backed collects actually
happening for some cases?

2018-04-28 9:48 GMT-07:00 Deepak Goel <deic...@gmail.com>:

> There is something as *virtual memory*
>
> On Sat, 28 Apr 2018, 21:19 Stephen Boesch, <java...@gmail.com> wrote:
>
>> Do you have a machine with  terabytes of RAM?  afaik collect() requires
>> RAM - so that would be your limiting factor.
>>
>> 2018-04-28 8:41 GMT-07:00 klrmowse <klrmo...@gmail.com>:
>>
>>> i am currently trying to find a workaround for the Spark application i am
>>> working on so that it does not have to use .collect()
>>>
>>> but, for now, it is going to have to use .collect()
>>>
>>> what is the size limit (memory for the driver) of RDD file that
>>> .collect()
>>> can work with?
>>>
>>> i've been scouring google-search - S.O., blogs, etc, and everyone is
>>> cautioning about .collect(), but does not specify how huge is huge...
>>> are we
>>> talking about a few gigabytes? terabytes?? petabytes???
>>>
>>>
>>>
>>> thank you
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>


Re: [Spark 2.x Core] .collect() size limit

2018-04-28 Thread Stephen Boesch
Do you have a machine with  terabytes of RAM?  afaik collect() requires RAM
- so that would be your limiting factor.

2018-04-28 8:41 GMT-07:00 klrmowse :

> i am currently trying to find a workaround for the Spark application i am
> working on so that it does not have to use .collect()
>
> but, for now, it is going to have to use .collect()
>
> what is the size limit (memory for the driver) of RDD file that .collect()
> can work with?
>
> i've been scouring google-search - S.O., blogs, etc, and everyone is
> cautioning about .collect(), but does not specify how huge is huge... are
> we
> talking about a few gigabytes? terabytes?? petabytes???
>
>
>
> thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: parquet vs orc files

2018-02-21 Thread Stephen Joung
In case of parquet, best source for me to configure and to ensure "min/max
statistics" was

https://www.slideshare.net/mobile/RyanBlue3/parquet-performance-tuning-the-missing-guide

---

I don't have any experience in orc.

2018년 2월 22일 (목) 오전 6:59, Kane Kim 님이 작성:

> Thanks, how does min/max index work? Can spark itself configure bloom
> filters when saving as orc?
>
> On Wed, Feb 21, 2018 at 1:40 PM, Jörn Franke  wrote:
>
>> In the latest version both are equally well supported.
>>
>> You need to insert the data sorted on filtering columns
>> Then you will benefit from min max indexes and in case of orc additional
>> from bloom filters, if you configure them.
>> In any case I recommend also partitioning of files (do not confuse with
>> Spark partitioning ).
>>
>> What is best for you you have to figure out in a test. This highly
>> depends on the data and the analysis you want to do.
>>
>> > On 21. Feb 2018, at 21:54, Kane Kim  wrote:
>> >
>> > Hello,
>> >
>> > Which format is better supported in spark, parquet or orc?
>> > Will spark use internal sorting of parquet/orc files (and how to test
>> that)?
>> > Can spark save sorted parquet/orc files?
>> >
>> > Thanks!
>>
>
>


Spark 2.2.1 EMR 5.11.1 Encrypted S3 bucket overwriting parquet file

2018-02-13 Thread Stephen Robinson
Hi All,


I am using the latest version of EMR to overwrite Parquet files to an S3 bucket 
encrypted with a KMS key. I am seeing the attached error whenever I Overwrite a 
parquet file. For example the below code produces the attached error and 
stacktrace:


List(1,2,3).toDF().write.mode("Overwrite").parquet("s3://some-encrypted-bucket/some-object")
List(1,2,3,4).toDF().write.mode("Overwrite").parquet("s3://some-encrypted-bucket/some-object")

The first call succeeds but the second fails.

If I change the s3:// part to the s3a:// protocal I do not see the error. I 
believe this to be an EMR error but mentioning it here just in case anyone else 
has seen this or if it might be a spark bug.

Thanks,

Steve



Stephen Robinson

steve.robin...@aquilainsight.com
+441312902300

[http://www.aquilainsight.com/wp-content/uploads/2018/01/Aquila_Merkle_Stacked_RGB2.jpg][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?docid=09bc3deabab834330b118c699d68811f3=AT811IVQ0fDqbqXikpeo8j4][https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=XWaweiSSd7YO1IFgfwqm3AAn7KKCsmBf%2f73IlT3d0zE%3d=0cea80d160d954b9584aef7090a5c4ef5=1]

www.aquilainsight.com<http://www.aquilainsight.com>
[https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=N79xtBiBY4r5ry1TCu0P%2bce%2f%2b3HFTwwamnQ47PieOoo%3d=03f7d1040c43f4fa0bcdf7f17fa89dfcc=1]linkedin.com/aquilainsight<https://www.linkedin.com/company/aquila-insight>
[https://aquilainsight.sharepoint.com/Phoenix/_layouts/15/guestaccess.aspx?guestaccesstoken=fdX1gHdkBdEZ%2bOap1Nr7kTrjMoFxgTZI4RfHFw0R7mw%3d=0869faaa87f6c402fa845a320c225e213=1]twitter.com/aquilainsight<http://twitter.com/aquilainsight>


This email and any attachments transmitted with it are intended for use by the 
intended recipient(s) only. If you have received this email in error, please 
notify the sender immediately and then delete it. If you are not the intended 
recipient, you must not keep, use, disclose, copy or distribute this email 
without the author's prior permission. We take precautions to minimize the risk 
of transmitting software viruses, but we advise you to perform your own virus 
checks on any attachment to this message. We cannot accept liability for any 
loss or damage caused by software viruses. The information contained in this 
communication may be confidential and may be subject to the attorney-client 
privilege.
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; 
Request ID: ???)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1588)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1258)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1030)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:742)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:716)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4169)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4116)
  at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.putObject(AmazonS3Client.java:1700)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.call.PutObjectCall.performCall(PutObjectCall.java:34)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.call.PutObjectCall.performCall(PutObjectCall.java:9)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.call.AbstractUploadingS3Call.perform(AbstractUploadingS3Call.java:62)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.executor.GlobalS3Executor.execute(GlobalS3Executor.java:80)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.invoke(AmazonS3LiteClient.java:176)
  at 
com.amazon.ws.emr.hadoop.fs.s3.lite.AmazonS3LiteClient.putObject(AmazonS3LiteClient.java:104)
  at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.storeEmptyFile(Jets3tNativeFileSystemStore.java:199)
  at 

Re: write parquet with statistics min max with binary field

2018-01-28 Thread Stephen Joung
After setting `parquet.strings.signed-min-max.enabled` to `true` in
`ShowMetaCommand.java`, parquet-tools meta show min,max.


@@ -57,8 +57,9 @@ public class ShowMetaCommand extends ArgsOnlyCommand {

 String[] args = options.getArgs();
 String input = args[0];

 Configuration conf = new Configuration();
+conf.set("parquet.strings.signed-min-max.enabled", "true");
 Path inputPath = new Path(input);
 FileStatus inputFileStatus =
inputPath.getFileSystem(conf).getFileStatus(inputPath);
 List footers = ParquetFileReader.readFooters(conf,
inputFileStatus, false);


Result

row group 1: RC:3 TS:56 OFFSET:4


field1:   BINARY SNAPPY DO:0 FPO:4 SZ:56/56/1.00 VC:3
ENC:DELTA_BYTE_ARRAY -- ST:[min: a, max: c, num_nulls: 0]


For the reference, this was intended symptom by PARQUET-686 [1].


[1] https://www.mail-archive.com/commits@parquet.apache.org/msg00491.html

2018-01-24 10:31 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>:

> How can I write parquet file with min/max statistic?
>
> 2018-01-24 10:30 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>:
>
>> Hi, I am trying to use spark sql filter push down. and specially want to
>> use row group skipping with parquet file.
>>
>> And I guessed that I need parquet file with statistics min/max.
>>
>> 
>>
>> On spark master branch - I tried to write single column with "a", "b",
>> "c" to parquet file f1
>>
>>scala> List("a", "b", "c").toDF("field1").coalesce(1
>> ).write.parquet("f1")
>>
>> But saved file does not have statistics (min, max)
>>
>>$ ls f1/*.parquet
>>f1/part-0-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
>>$ parquet-tool meta  f1/*.parquet
>>file:file:/Users/stephen/p/spark/f
>> 1/part-0-445036f9-7a40-4333-8405-8451faa44319- c000.snappy.parquet
>>creator: parquet-mr version 1.8.2 (build
>> c6522788629e590a53eb79874b95f6c3ff11f16c)
>>extra:   org.apache.spark.sql.parquet.row.metadata =
>> {"type":"struct","fields":[{"name":"field1","type":"string",
>> "nullable":true,"metadata":{}}]}
>>
>>file schema: spark_schema
>>---
>> -
>>field1:  OPTIONAL BINARY O:UTF8 R:0 D:1
>>
>>row group 1: RC:3 TS:48 OFFSET:4
>>---
>> -
>>field1:   BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
>> ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
>>
>> 
>>
>> Any pointer or comment would be appreciated.
>> Thank you.
>>
>>
>


Re: write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
How can I write parquet file with min/max statistic?

2018-01-24 10:30 GMT+09:00 Stephen Joung <step...@vcnc.co.kr>:

> Hi, I am trying to use spark sql filter push down. and specially want to
> use row group skipping with parquet file.
>
> And I guessed that I need parquet file with statistics min/max.
>
> 
>
> On spark master branch - I tried to write single column with "a", "b", "c"
> to parquet file f1
>
>scala> List("a", "b", "c").toDF("field1").coalesce(
> 1).write.parquet("f1")
>
> But saved file does not have statistics (min, max)
>
>$ ls f1/*.parquet
>f1/part-00000-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
>$ parquet-tool meta  f1/*.parquet
>file:
> file:/Users/stephen/p/spark/f1/part-0-445036f9-7a40-4333-8405-8451faa44319-
> c000.snappy.parquet
>creator: parquet-mr version 1.8.2 (build
> c6522788629e590a53eb79874b95f6c3ff11f16c)
>extra:   org.apache.spark.sql.parquet.row.metadata =
> {"type":"struct","fields":[{"name":"field1","type":"string"
> ,"nullable":true,"metadata":{}}]}
>
>file schema: spark_schema
>---
> -
>field1:  OPTIONAL BINARY O:UTF8 R:0 D:1
>
>row group 1: RC:3 TS:48 OFFSET:4
>---
> -
>field1:   BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
> ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]
>
> 
>
> Any pointer or comment would be appreciated.
> Thank you.
>
>


write parquet with statistics min max with binary field

2018-01-23 Thread Stephen Joung
Hi, I am trying to use spark sql filter push down. and specially want to
use row group skipping with parquet file.

And I guessed that I need parquet file with statistics min/max.



On spark master branch - I tried to write single column with "a", "b", "c"
to parquet file f1

   scala> List("a", "b", "c").toDF("field1").coalesce(1).write.parquet("f1")

But saved file does not have statistics (min, max)

   $ ls f1/*.parquet
   f1/part-0-445036f9-7a40-4333-8405-8451faa44319-c000.snappy.parquet
   $ parquet-tool meta  f1/*.parquet
   file:
 file:/Users/stephen/p/spark/f1/part-0-445036f9-7a40-4333-8405-8451faa44319-
c000.snappy.parquet
   creator: parquet-mr version 1.8.2 (build
c6522788629e590a53eb79874b95f6c3ff11f16c)
   extra:   org.apache.spark.sql.parquet.row.metadata =
{"type":"struct","fields":[{"name":"field1","type":"string","nullable":true,"metadata":{}}]}

   file schema: spark_schema

 

   field1:  OPTIONAL BINARY O:UTF8 R:0 D:1

   row group 1: RC:3 TS:48 OFFSET:4

 

   field1:   BINARY SNAPPY DO:0 FPO:4 SZ:50/48/0.96 VC:3
ENC:BIT_PACKED,RLE,PLAIN ST:[no stats for this column]



Any pointer or comment would be appreciated.
Thank you.


Has there been any explanation on the performance degradation between spark.ml and Mllib?

2018-01-21 Thread Stephen Boesch
While MLLib performed favorably vs Flink it *also *performed favorably vs
spark.ml ..  and by an *order of magnitude*.  The following is one of the
tables - it is for Logistic Regression.  At that time spark.ML did not yet
support SVM

From: https://bdataanalytics.biomedcentral.com/articles/10.
1186/s41044-016-0020-2



Table 3

LR learning time in seconds

Dataset

Spark MLlib

Spark ML

Flink

ECBDL14-10

3

26

181

ECBDL14-30

5

63

815

ECBDL14-50

6

173

1314

ECBDL14-75

8

260

1878

ECBDL14-100

12

415

2566

The DataFrame based API (spark.ml) is even slower vs the RDD (mllib) than
had been anticipated - yet the latter has been shutdown for several
versions of Spark already.  What is the thought process behind that
decision : *performance matters! *Is there visibility into a meaningful
narrowing of that gap?


Re: Anyone know where to find independent contractors in New York?

2017-12-21 Thread Stephen Boesch
Hi Richard, this is not a jobs board: please only discuss spark application
development issues.

2017-12-21 8:34 GMT-08:00 Richard L. Burton III :

> I'm trying to locate four independent contractors who have experience with
> Spark. I'm not sure where I can go to find experienced Spark consultants.
>
> Please, no recruiters.
> --
> -Richard L. Burton III
>
>


Re: LDA and evaluating topic number

2017-12-07 Thread Stephen Boesch
I have been testing on the 20 NewsGroups dataset - which the Spark docs
themselves reference.  I can confirm that perplexity increases and
likelihood decreases as topics increase - and am similarly confused by
these results.

2017-09-28 10:50 GMT-07:00 Cody Buntain :

> Hi, all!
>
> Is there an example somewhere on using LDA’s logPerplexity()/logLikelihood()
> functions to evaluate topic counts? The existing MLLib LDA examples show
> calling them, but I can’t find any documentation about how to interpret the
> outputs. Graphing the outputs for logs of perplexity and likelihood aren’t
> consistent with what I expected (perplexity increases and likelihood
> decreases as topics increase, which seem odd to me).
>
> An example of what I’m doing is here: http://www.cs.umd.edu/~
> cbuntain/FindTopicK-pyspark-regex.html
>
> Thanks very much in advance! If I can figure this out, I can post example
> code online, so others can see how this process is done.
>
> -Best regards,
> Cody
> _
> Cody Buntain, PhD
> Postdoc, @UMD_CS
> Intelligence Community Postdoctoral Fellow
> cbunt...@cs.umd.edu
> www.cs.umd.edu/~cbuntain
>
>


Weight column values not used in Binary Logistic Regression Summary

2017-11-18 Thread Stephen Boesch
In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a
number of comments identical to the following:

* @note This ignores instance weights (setting all to 1.0) from
`LogisticRegression.weightCol`.
* This will change in later Spark versions.


Are there any plans to address this? Our team is using instance weights
with sklearn LogisticRegression - and this limitation will complicate a
potential migration.


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L1543


Re: Spark streaming for CEP

2017-10-24 Thread Stephen Boesch
Hi Mich, the github link has a brief intro - including a link to the formal
docs http://logisland.readthedocs.io/en/latest/index.html .   They have an
architectural overview, developer guide, tutorial, and pretty comprehensive
api docs.

2017-10-24 13:31 GMT-07:00 Mich Talebzadeh :

> thanks Thomas.
>
> do you have a summary write-up for this tool please?
>
>
> regards,
>
>
>
>
> Thomas
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 24 October 2017 at 13:53, Thomas Bailet 
> wrote:
>
>> Hi
>>
>> we (@ hurence) have released on open source middleware based on
>> SparkStreaming over Kafka to do CEP and log mining, called *logisland*  (
>> https://github.com/Hurence/logisland/) it has been deployed into
>> production for 2 years now and does a great job. You should have a look.
>>
>>
>> bye
>>
>> Thomas Bailet
>>
>> CTO : hurence
>>
>> Le 18/10/17 à 22:05, Mich Talebzadeh a écrit :
>>
>> As you may be aware the granularity that Spark streaming has is
>> micro-batching and that is limited to 0.5 second. So if you have continuous
>> ingestion of data then Spark streaming may not be granular enough for CEP.
>> You may consider other products.
>>
>> Worth looking at this old thread on mine "Spark support for Complex Event
>> Processing (CEP)
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201604.
>> mbox/%3CCAJ3fcbB8eaf0JV84bA7XGUK5GajC1yGT3ZgTNCi8arJg56=LbQ@
>> mail.gmail.com%3E
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 18 October 2017 at 20:52, anna stax  wrote:
>>
>>> Hello all,
>>>
>>> Has anyone used spark streaming for CEP (Complex Event processing).  Any
>>> CEP libraries that works well with spark. I have a use case for CEP and
>>> trying to see if spark streaming is a good fit.
>>>
>>> Currently we have a data pipeline using Kafka, Spark streaming and
>>> Cassandra for data ingestion and near real time dashboard.
>>>
>>> Please share your experience.
>>> Thanks much.
>>> -Anna
>>>
>>>
>>>
>>
>>
>


Re: Is there a difference between df.cache() vs df.rdd.cache()

2017-10-13 Thread Stephen Boesch
@Vadim   Would it be true to say the `.rdd` *may* be creating a new job -
depending on whether the DataFrame/DataSet had already been materialized
via an action or checkpoint?   If the only prior operations on the
DataFrame had been transformations then the dataframe would still not have
been calculated.  In that case would it also be true that a subsequent
action/checkpoint on the DataFrame (not the rdd) would then generate a
separate job?

2017-10-13 14:50 GMT-07:00 Vadim Semenov :

> When you do `Dataset.rdd` you actually create a new job
>
> here you can see what it does internally:
> https://github.com/apache/spark/blob/master/sql/core/src/
> main/scala/org/apache/spark/sql/Dataset.scala#L2816-L2828
>
>
>
> On Fri, Oct 13, 2017 at 5:24 PM, Supun Nakandala <
> supun.nakand...@gmail.com> wrote:
>
>> Hi Weichen,
>>
>> Thank you for the reply.
>>
>> My understanding was Dataframe API is using the old RDD implementation
>> under the covers though it presents a different API. And calling
>> df.rdd will simply give access to the underlying RDD. Is this assumption
>> wrong? I would appreciate if you can shed more insights on this issue or
>> point me to documentation where I can learn them.
>>
>> Thank you in advance.
>>
>> On Fri, Oct 13, 2017 at 3:19 AM, Weichen Xu 
>> wrote:
>>
>>> You should use `df.cache()`
>>> `df.rdd.cache()` won't work, because `df.rdd` generate a new RDD from
>>> the original `df`. and then cache the new RDD.
>>>
>>> On Fri, Oct 13, 2017 at 3:35 PM, Supun Nakandala <
>>> supun.nakand...@gmail.com> wrote:
>>>
 Hi all,

 I have been experimenting with cache/persist/unpersist methods with
 respect to both Dataframes and RDD APIs. However, I am experiencing
 different behaviors Ddataframe API compared RDD API such Dataframes are not
 getting cached when count() is called.

 Is there a difference between how these operations act wrt to Dataframe
 and RDD APIs?

 Thank You.
 -Supun

>>>
>>>
>>
>


Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
Thinking more carefully on your comment:

   - There may be some ambiguity as to whether the repo provided libraries
   are actually being used here - as you indicate - instead of the in-project
   classes. That would have to do with how the classpath inside IJ were
   constructed.   When I click through any of the spark classes in the IDE
   they do go directly to the in-project versions and not the repo jars: but
   that may not be definitive
   - In any case I had already performed the maven install and just
   verified that the jar's do have the correct timestamps in the local maven
   repo
   - The local maven repo is included by default  - so should not need to
   do anything special there

The same errors from the original post continue to occur.


2017-10-11 20:05 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> A clarification here: the example is being run *from the Spark codebase*.
> Therefore the mvn install step would not be required as the classes are
> available directly within the project.
>
> The reason for needing the `mvn package` to be invoked is to pick up the
> changes of having updated the spark dependency scopes from *provided *to
> *compile*.
>
> As mentioned the spark unit tests are working (and within Intellij and
> without `mvn install`): only the examples are not.
>
> 2017-10-11 19:43 GMT-07:00 Paul <corley.paul1...@gmail.com>:
>
>> You say you did the maven package but did you do a maven install and
>> define your local maven repo in SBT?
>>
>> -Paul
>>
>> Sent from my iPhone
>>
>> On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote:
>>
>> When attempting to run any example program w/ Intellij I am running into
>> guava versioning issues:
>>
>> Exception in thread "main" java.lang.NoClassDefFoundError:
>> com/google/common/cache/CacheLoader
>> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkCon
>> f.scala:73)
>> at org.apache.spark.SparkConf.(SparkConf.scala:68)
>> at org.apache.spark.SparkConf.(SparkConf.scala:55)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:919)
>> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(S
>> parkSession.scala:918)
>> at scala.Option.getOrElse(Option.scala:121)
>> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkS
>> ession.scala:918)
>> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
>> le.scala:40)
>> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
>> Caused by: java.lang.ClassNotFoundException:
>> com.google.common.cache.CacheLoader
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>> ... 9 more
>>
>> The *scope*s for the spark dependencies were already changed from
>> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
>> already been run (successfully) from command line - and the (mvn) project
>> completely rebuilt inside intellij.
>>
>> The spark testcases run fine: this is a problem only in the examples
>> module.  Anyone running these successfully in IJ?  I have tried for
>> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>>
>>
>>
>


Re: Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
A clarification here: the example is being run *from the Spark codebase*.
Therefore the mvn install step would not be required as the classes are
available directly within the project.

The reason for needing the `mvn package` to be invoked is to pick up the
changes of having updated the spark dependency scopes from *provided *to
*compile*.

As mentioned the spark unit tests are working (and within Intellij and
without `mvn install`): only the examples are not.

2017-10-11 19:43 GMT-07:00 Paul <corley.paul1...@gmail.com>:

> You say you did the maven package but did you do a maven install and
> define your local maven repo in SBT?
>
> -Paul
>
> Sent from my iPhone
>
> On Oct 11, 2017, at 5:48 PM, Stephen Boesch <java...@gmail.com> wrote:
>
> When attempting to run any example program w/ Intellij I am running into
> guava versioning issues:
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> com/google/common/cache/CacheLoader
> at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
> at org.apache.spark.SparkConf.(SparkConf.scala:68)
> at org.apache.spark.SparkConf.(SparkConf.scala:55)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:919)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(
> SparkSession.scala:918)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(
> SparkSession.scala:918)
> at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExamp
> le.scala:40)
> at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
> Caused by: java.lang.ClassNotFoundException:
> com.google.common.cache.CacheLoader
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 9 more
>
> The *scope*s for the spark dependencies were already changed from
> *provided* to *compile* .  Both `sbt assembly` and `mvn package` had
> already been run (successfully) from command line - and the (mvn) project
> completely rebuilt inside intellij.
>
> The spark testcases run fine: this is a problem only in the examples
> module.  Anyone running these successfully in IJ?  I have tried for
> 2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.
>
>
>


Running spark examples in Intellij

2017-10-11 Thread Stephen Boesch
When attempting to run any example program w/ Intellij I am running into
guava versioning issues:

Exception in thread "main" java.lang.NoClassDefFoundError:
com/google/common/cache/CacheLoader
at org.apache.spark.SparkConf.loadFromSystemProperties(SparkConf.scala:73)
at org.apache.spark.SparkConf.(SparkConf.scala:68)
at org.apache.spark.SparkConf.(SparkConf.scala:55)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:919)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$
7.apply(SparkSession.scala:918)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
scala:918)
at org.apache.spark.examples.ml.KMeansExample$.main(KMeansExample.scala:40)
at org.apache.spark.examples.ml.KMeansExample.main(KMeansExample.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.cache.
CacheLoader
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 9 more

The *scope*s for the spark dependencies were already changed from
*provided* to *compile* .  Both `sbt assembly` and `mvn package` had
already been run (successfully) from command line - and the (mvn) project
completely rebuilt inside intellij.

The spark testcases run fine: this is a problem only in the examples
module.  Anyone running these successfully in IJ?  I have tried for
2.1.0-SNAPSHOT and 2.3.0-SNAPSHOT - with the same outcome.


Re: SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
The correct link is
https://docs.databricks.com/spark/latest/spark-sql/index.html .

This link does have the core syntax such as the BNF for the DDL and DML and
SELECT.  It does *not *have a reference for  date / string / numeric
functions: is there any such reference at this point?  It is not sufficient
to peruse the DSL list of functions since the usage is different (and
sometimes the names as well)  than from the DSL.

thanks
stephenb

2017-08-10 14:49 GMT-07:00 Jules Damji <dmat...@comcast.net>:

> I refer to docs.databricks.com/Spark/latest/Spark-sql/index.html.
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> > On Aug 10, 2017, at 1:46 PM, Stephen Boesch <java...@gmail.com> wrote:
> >
> >
> > While the DataFrame/DataSets are useful in many circumstances they are
> cumbersome for many types of complex sql queries.
> >
> > Is there an up to date *SQL* reference - i.e. not DataFrame DSL
> operations - for version 2.2?
> >
> > An example of what is not clear:  what constructs are supported within
> >
> > select count( predicate) from some_table
> >
> > when using spark sql.
> >
> > But in general the reference guide and programming guide for SQL seems
> to be difficult to locate - seemingly in favor of the DataFrame/DataSets.
>
>


SQL specific documentation for recent Spark releases

2017-08-10 Thread Stephen Boesch
While the DataFrame/DataSets are useful in many circumstances they are
cumbersome for many types of complex sql queries.

Is there an up to date *SQL* reference - i.e. not DataFrame DSL operations
- for version 2.2?

An example of what is not clear:  what constructs are supported within

select count( predicate) from some_table

when using spark sql.

But in general the reference guide and programming guide for SQL seems to
be difficult to locate - seemingly in favor of the DataFrame/DataSets.


custom joins on dataframe

2017-07-22 Thread Stephen Fletcher
Normally a family of joins (left, right outter, inner) are performed on two
dataframes using columns for the comparison ie left("acol") ===
ight("acol") . the comparison operator of the "left" dataframe does
something internally and produces a column that i assume is used by the
join.

What I want is to create my own comparison operation (i have a case where i
want to use some fuzzy matching between rows and if they fall within some
threshold we allow the join to happen)

so it would look something like

left.join(right, my_fuzzy_udf (left("cola"),right("cola")))

Where my_fuzzy_udf  is my defined UDF. My main concern is the column that
would have to be output what would its value be ie what would the function
need to return that the udf susbsystem would then turn to a column to be
evaluated by the join.


Thanks in advance for any advice


Re: What is the equivalent of mapPartitions in SpqrkSQL?

2017-06-25 Thread Stephen Boesch
Spark SQL did not support explicit partitioners even before tungsten: and
often enough this did hurt performance.  Even now Tungsten will not do the
best job every time: so the question from the OP is still germane.

2017-06-25 19:18 GMT-07:00 Ryan :

> Why would you like to do so? I think there's no need for us to explicitly
> ask for a forEachPartition in spark sql because tungsten is smart enough to
> figure out whether a sql operation could be applied on each partition or
> there has to be a shuffle.
>
> On Sun, Jun 25, 2017 at 11:32 PM, jeff saremi 
> wrote:
>
>> You can do a map() using a select and functions/UDFs. But how do you
>> process a partition using SQL?
>>
>>
>>
>


Re: Using SparkContext in Executors

2017-05-28 Thread Stephen Boesch
You would need to use *native* Cassandra API's in each Executor -
not org.apache.spark.sql.cassandra.CassandraSQLContext
-  including to create a separate Cassandra connection on each Executor.

2017-05-28 15:47 GMT-07:00 Abdulfattah Safa :

> So I can't run SQL queries in Executors ?
>
> On Sun, May 28, 2017 at 11:00 PM Mark Hamstra 
> wrote:
>
>> You can't do that. SparkContext and SparkSession can exist only on the
>> Driver.
>>
>> On Sun, May 28, 2017 at 6:56 AM, Abdulfattah Safa 
>> wrote:
>>
>>> How can I use SparkContext (to create Spark Session or Cassandra
>>> Sessions) in executors?
>>> If I pass it as parameter to the foreach or foreachpartition, then it
>>> will have a null value.
>>> Shall I create a new SparkContext in each executor?
>>>
>>> Here is what I'm trying to do:
>>> Read a dump directory with millions of dump files as follows:
>>>
>>> dumpFiles = Directory.listFiles(dumpDirectory)
>>> dumpFilesRDD = sparkContext.parallize(dumpFiles, numOfSlices)
>>> dumpFilesRDD.foreachPartition(dumpFilePath->parse(dumpFilePath))
>>> .
>>> .
>>> .
>>>
>>> In parse(), each dump file is parsed and inserted into database using
>>> SparlSQL. In order to do that, SparkContext is needed in the function parse
>>> to use the sql() method.
>>>
>>
>>


Re: Jupyter spark Scala notebooks

2017-05-17 Thread Stephen Boesch
Jupyter with toree works well for my team.  Jupyter is well more refined vs
zeppelin as far as notebook features and usability: shortcuts, editing,etc.
  The caveat is it is better to run a separate server instanace for
python/pyspark vs scala/spark

2017-05-17 19:27 GMT-07:00 Richard Moorhead :

> Take a look at Apache Zeppelin; it has both python and scala interpreters.
> https://zeppelin.apache.org/
> Apache Zeppelin 
> zeppelin.apache.org
> Apache Zeppelin. A web-based notebook that enables interactive data
> analytics. You can make beautiful data-driven, interactive and
> collaborative documents with SQL ...
>
>
>
>
> . . . . . . . . . . . . . . . . . . . . . . . . . . .
>
> Richard Moorhead
> Software Engineer
> richard.moorh...@c2fo.com 
>
> C2FO: The World's Market for Working Capital®
>
>
> 
>     
> 
> 
> 
>
> The information contained in this message and any attachment may be
> privileged, confidential, and protected from disclosure. If you are not the
> intended recipient, or an employee, or agent responsible for delivering
> this message to the intended recipient, you are hereby notified that any
> dissemination, distribution, or copying of this communication is strictly
> prohibited. If you have received this communication in error, please notify
> us immediately by replying to the message and deleting from your computer.
>
> --
> *From:* upendra 1991 
> *Sent:* Wednesday, May 17, 2017 9:22:14 PM
> *To:* user@spark.apache.org
> *Subject:* Jupyter spark Scala notebooks
>
> What's the best way to use jupyter with Scala spark. I tried Apache toree
> and created a kernel but did not get it working. I believe there is a
> better way.
>
> Please suggest any best practices.
>
> Sent from Yahoo Mail on Android
> 
>


KTable like functionality in structured streaming

2017-05-16 Thread Stephen Fletcher
Are there any plans to add Kafka Streams KTable like functionality in
structured streaming for kafka sources? Allowing querying keyed messages
using spark sql,maybe calling KTables in the backend


Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming,

Jacek also has a really good online spark book for spark 2, "mastering
spark". I found it very helpful when trying to understand spark 2's
encoders.

his book is here:
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details


On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
wrote:

> The Apache Spark documentation is good to begin with.
> All the programming guides, particularly.
>
>
> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>
>> I would suggest do not buy any book, just start with databricks community
>> edition
>>
>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>
>>> Well that is the nature of technology, ever evolving. There will always
>>> be new concepts. If you're trying to get started ASAP and the internet
>>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>>> production stacks are still on that version and the knowledge from
>>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>>
>>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>>
 I'm trying to decide whether to buy the book learning spark, spark for
 machine learning etc. or wait for a new edition covering the new concepts
 like dataframe and datasets. Anyone got any suggestions?

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


Contributed to spark

2017-04-07 Thread Stephen Fletcher
I'd like to eventually contribute to spark, but I'm noticing since spark 2
the query planner is heavily used throughout Dataset code base. Are there
any sites I can go to that explain the technical details, more than just
from a high-level prospective


reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to
spark 2 I find myself increasing dissatisfied with the idea of converting
dataframes to RDD to do procedural programming on grouped data(both from a
ease of programming stance and performance stance). So I've been using
Dataframe's experimental groupByKey and flatMapGroups which perform
extremely well, I'm guessing because of the encoders, but the amount of
data being transfers is a little excessive. Is there any plans to port
reduceByKey ( and additionally a reduceByKeyleft and right)?


Re: attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
sorry here's the whole code

val source =
spark.read.format("parquet").load("/emrdata/sources/very_large_ds")

implicit val mapEncoder =
org.apache.spark.sql.Encoders.kryo[(Any,ArrayBuffer[Row])]

source.map{ row => {
  val key = row(0)
  val buff = new ArrayBuffer[Row]()
  buff += row
  (key,buff)
   }
}

...

On Sun, Feb 26, 2017 at 7:31 AM, Stephen Fletcher <
stephen.fletc...@gmail.com> wrote:

> I'm attempting to perform a map on a Dataset[Row] but getting an error on
> decode when attempting to pass a custom encoder.
>  My code looks similar to the following:
>
>
> val source = spark.read.format("parquet").load("/emrdata/sources/very_
> large_ds")
>
>
>
> source.map{ row => {
>   val key = row(0)
>
>}
> }
>


attempting to map Dataset[Row]

2017-02-26 Thread Stephen Fletcher
I'm attempting to perform a map on a Dataset[Row] but getting an error on
decode when attempting to pass a custom encoder.
 My code looks similar to the following:


val source =
spark.read.format("parquet").load("/emrdata/sources/very_large_ds")



source.map{ row => {
  val key = row(0)

   }
}


pyspark in intellij

2017-02-25 Thread Stephen Boesch
Anyone have this working - either in 1.X or 2.X?

thanks


Re: Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
For now I have added to the log4j.properties:

log4j.logger.org.apache.parquet=ERROR

2017-02-18 11:50 GMT-08:00 Stephen Boesch <java...@gmail.com>:

> The following JIRA mentions that a fix made to read parquet 1.6.2 into 2.X  
> STILL leaves an "avalanche" of warnings:
>
>
> https://issues.apache.org/jira/browse/SPARK-17993
>
> Here is the text inside one of the last comments before it was merged:
>
>   I have built the code from the PR and it indeed succeeds reading the data.
>   I have tried doing df.count() and now I'm swarmed with warnings like this 
> (they are just keep getting printed endlessly in the terminal):
>   16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics because 
> created_by could not be parsed (see PARQUET-251): parquet-mr version 1.6.0
>   org.apache.parquet.VersionParser$VersionParseException: Could not parse 
> created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) 
> )?\(build ?(.*)\)
> at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
> at 
> org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
> at 
> org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
> at
>
> I am running 2.1.0 release and there are multitudes of these warnings. Is 
> there any way - short of changing logging level to ERROR - to suppress these?
>
>
>


Avalance of warnings trying to read Spark 1.6.X Parquet into Spark 2.X

2017-02-18 Thread Stephen Boesch
The following JIRA mentions that a fix made to read parquet 1.6.2 into
2.X  STILL leaves an "avalanche" of warnings:


https://issues.apache.org/jira/browse/SPARK-17993

Here is the text inside one of the last comments before it was merged:

  I have built the code from the PR and it indeed succeeds reading the data.
  I have tried doing df.count() and now I'm swarmed with warnings like
this (they are just keep getting printed endlessly in the terminal):
  16/08/11 12:18:51 WARN CorruptStatistics: Ignoring statistics
because created_by could not be parsed (see PARQUET-251): parquet-mr
version 1.6.0
  org.apache.parquet.VersionParser$VersionParseException: Could not
parse created_by: parquet-mr version 1.6.0 using format: (.+) version
((.*) )?\(build ?(.*)\)
at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetMetadata(ParquetMetadataConverter.java:567)
at 
org.apache.parquet.format.converter.ParquetMetadataConverter.readParquetMetadata(ParquetMetadataConverter.java:544)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:431)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:386)
at 
org.apache.spark.sql.execution.datasources.parquet.SpecificParquetRecordReaderBase.initialize(SpecificParquetRecordReaderBase.java:107)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.initialize(VectorizedParquetRecordReader.java:109)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:369)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$buildReader$1.apply(ParquetFileFormat.scala:343)
at

I am running 2.1.0 release and there are multitudes of these warnings.
Is there any way - short of changing logging level to ERROR - to
suppress these?


Re: Spark/Mesos with GPU support

2016-12-30 Thread Stephen Boesch
Would it be possible to share that communication?  I am interested in this
thread.

2016-12-30 11:02 GMT-08:00 Ji Yan :

> Thanks Michael, Tim and I have touched base and thankfully the issue has
> already been resolved
>
> On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt 
> wrote:
>
>> I've cc'd Tim and Kevin, who worked on GPU support.
>>
>> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan  wrote:
>>
>>> Dear Spark Users,
>>>
>>> Has anyone had successful experience running Spark on Mesos with GPU
>>> support? We have a Mesos cluster that can see and offer nvidia GPU
>>> resources. With Spark, it seems that the GPU support with Mesos (
>>> https://github.com/apache/spark/pull/14644) has only recently been
>>> merged into Spark Master which is not found in 2.0.2 release yet. We have a
>>> custom built Spark from 2.1-rc5 which is confirmed to have the above
>>> change. However when we try to run any code from Spark on this Mesos setup,
>>> the spark program hangs and keeps saying
>>>
>>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>>> check your cluster UI to ensure that workers are registered and have
>>> sufficient resources”
>>>
>>> We are pretty sure that the cluster has enough resources as there is
>>> nothing running on it. If we disable the GPU support in configuration and
>>> restart mesos and retry the same program, it would work.
>>>
>>> Any comment/advice on this greatly appreciated
>>>
>>> Thanks,
>>> Ji
>>>
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>>
>> --
>> Michael Gummelt
>> Software Engineer
>> Mesosphere
>>
>
>
> The information in this email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful.
>


Re: Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
This problem appears to be a regression on HEAD/master:  when running
against 2.0.2 the pyspark job completes successfully including running
predictions.

2016-11-23 19:36 GMT-08:00 Stephen Boesch <java...@gmail.com>:

>
> For a pyspark job with 54 executors all of the task outputs have a single
> line in both the stderr and stdout similar to:
>
> Error: invalid log directory 
> /shared/sparkmaven/work/app-20161119222540-/0/
>
>
> Note: the directory /shared/sparkmaven/work exists and is owned by the
> same user running the job. There are plenty of other app-*** subdirectories
> that do have contents in the stdout/stderr files.
>
>
> $ls -lrta  /shared/sparkmaven/work
> total 0
> drwxr-xr-x  59 steve  staff  2006 Nov 23 05:01 ..
> drwxr-xr-x  41 steve  staff  1394 Nov 23 18:22 app-20161123050122-0002
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:22 app-20161123182031-0005
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:44 app-20161123184349-0006
> drwxr-xr-x   6 steve  staff   204 Nov 23 18:46 app-20161123184613-0007
> drwxr-xr-x   3 steve  staff   102 Nov 23 19:20 app-20161123192048-0008
>
>
>
> Here is a sample of the contents
>
> /shared/sparkmaven/work/app-20161123184613-0007/2:
> total 16
> -rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
> drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
> -rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
> drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
>
> /shared/sparkmaven/work/app-20161123184613-0007/3:
> total 16
> -rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
> drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
> drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
> -rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
>
>
> Note also:  the *SparkPI* program does run succesfully - which validates
> the basic spark installation/functionality.
>
>


Invalid log directory running pyspark job

2016-11-23 Thread Stephen Boesch
For a pyspark job with 54 executors all of the task outputs have a single
line in both the stderr and stdout similar to:

Error: invalid log directory /shared/sparkmaven/work/app-20161119222540-/0/


Note: the directory /shared/sparkmaven/work exists and is owned by the same
user running the job. There are plenty of other app-*** subdirectories that
do have contents in the stdout/stderr files.


$ls -lrta  /shared/sparkmaven/work
total 0
drwxr-xr-x  59 steve  staff  2006 Nov 23 05:01 ..
drwxr-xr-x  41 steve  staff  1394 Nov 23 18:22 app-20161123050122-0002
drwxr-xr-x   6 steve  staff   204 Nov 23 18:22 app-20161123182031-0005
drwxr-xr-x   6 steve  staff   204 Nov 23 18:44 app-20161123184349-0006
drwxr-xr-x   6 steve  staff   204 Nov 23 18:46 app-20161123184613-0007
drwxr-xr-x   3 steve  staff   102 Nov 23 19:20 app-20161123192048-0008



Here is a sample of the contents

/shared/sparkmaven/work/app-20161123184613-0007/2:
total 16
-rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
-rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr
drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..

/shared/sparkmaven/work/app-20161123184613-0007/3:
total 16
-rw-r--r--  1 steve  staff 0 Nov 23 18:46 stdout
drwxr-xr-x  6 steve  staff   204 Nov 23 18:46 ..
drwxr-xr-x  4 steve  staff   136 Nov 23 18:46 .
-rw-r--r--  1 steve  staff  4830 Nov 23 18:46 stderr


Note also:  the *SparkPI* program does run succesfully - which validates
the basic spark installation/functionality.


Re: HPC with Spark? Simultaneous, parallel one to one mapping of partition to vcore

2016-11-19 Thread Stephen Boesch
While "apparently" saturating the N available workers using your proposed N
partitions - the "actual" distribution of workers to tasks is controlled by
the scheduler.  If my past experience were of service - you can *not *trust
the default Fair Scheduler to ensure the round-robin scheduling of the
tasks: you may well end up with tasks being queued.

The suggestion is to try it out on the resource manager and scheduler being
used for your deployment. You may need to swap out their default scheduler
for a true round robin.

2016-11-19 16:44 GMT-08:00 Adam Smith :

> Dear community,
>
> I have a RDD with N rows and N partitions. I want to ensure that the
> partitions run all at the some time, by setting the number of vcores
> (spark-yarn) to N. The partitions need to talk to each other with some
> socket based sync that is why I need them to run more or less
> simultaneously.
>
> Let's assume no node will die. Will my setup guarantee that all partitions
> are computed in parallel?
>
> I know this is somehow hackish. Is there a better way doing so?
>
> My goal is replicate message passing (like OpenMPI) with spark, where I
> have very specific and final communcation requirements. So no need for the
> many comm and sync funtionality, just what I already have - sync and talk.
>
> Thanks!
> Adam
>
>


Spark-packages

2016-11-06 Thread Stephen Boesch
What is the state of the spark-packages project(s) ?  When running a query
for machine learning algorithms the results are not encouraging.


https://spark-packages.org/?q=tags%3A%22Machine%20Learning%22

There are 62 packages. Only a few have actual releases - and even less with
dates in the past twelve months.

There are several from DataBricks among the chosen few that have recent
releases.

Here is one that actually seems to be in reasonable shape: the DB port of
Stanford coreNLP.

https://github.com/databricks/spark-corenlp

But .. one or two solid packages .. ?

It seems the  spark-packages approach seems not to have picked up  steam..
  Are there other places suggested to look for algorithms not included in
mllib itself ?


Re: Use BLAS object for matrix operation

2016-11-03 Thread Stephen Boesch
It is private. You will need to put your code in that same package or
create an accessor to it living within that package

private[spark]


2016-11-03 16:04 GMT-07:00 Yanwei Zhang :

> I would like to use some matrix operations in the BLAS object defined in
> ml.linalg. But for some reason, spark shell complains it cannot locate this
> object. I have constructed an example below to illustrate the issue. Please
> advise how to fix this. Thanks .
>
>
>
> import org.apache.spark.ml.linalg.{BLAS, DenseVector, Vector, Vectors}
>
>
> val a = Vectors.dense(1.0, 2.0)
> val b = Vectors.dense(2.0, 3.0)
> BLAS.dot(a, b)
>
>
> :42: error: not found: value BLAS
>
>
>


Re: Aggregation Calculation

2016-11-03 Thread Stephen Boesch
You would likely want to create inline views that perform the filtering *before
*performing t he cubes/rollup; in this way the cubes/rollups only operate
on the pruned rows/columns.

2016-11-03 11:29 GMT-07:00 Andrés Ivaldi :

> Hello, I need to perform some aggregations and a kind of Cube/RollUp
> calculation
>
> Doing some test looks like Cube and RollUp performs aggregation over all
> posible columns combination, but I just need some specific columns
> combination.
>
> What I'm trying to do is like a dataTable where te first N columns are may
> rows and the second M values are my columns and the last columna are the
> aggregated values, like Dimension / Measures
>
> I need all the values of the N and M columns and the ones that correspond
> to the aggregation function. I'll never need the values that previous
> column has no value, ie
>
> having N=2 so two columns as rows I'll need
> R1 | R2  
> ##  |  ## 
> ##  |   null 
>
> but not
> null | ## 
>
> as roll up does, same approach to M columns
>
>
> So the question is what could be the better way to perform this
> calculation.
> Using rollUp/Cube give me a lot of values that I dont need
> Using groupBy give me less information ( I could do several groupBy but
> that is not performant, I think )
> Is any other way to something like that?
>
> Thanks.
>
>
>
>
>
> --
> Ing. Ivaldi Andres
>


DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

2016-10-12 Thread Stephen Hankinson
Hi,

We have a cluster running Apache Spark 2.0 on Hadoop 2.7.2, Centos 7.2. We
had written some new code using the Spark DataFrame/DataSet APIs but are
noticing incorrect results on a join after writing and then reading data to
Windows Azure Storage Blobs (The default HDFS location). I've been able to
duplicate the issue with the following snippet of code running on the
cluster.

case class UserDimensions(user: Long, dimension: Long, score: Double)
case class CentroidClusterScore(dimension: Long, cluster: Int, score: Double)

val dims = sc.parallelize(Array(UserDimensions(12345, 0, 1.0))).toDS
val cent = sc.parallelize(Array(CentroidClusterScore(0, 1,
1.0),CentroidClusterScore(1, 0, 1.0),CentroidClusterScore(2, 2,
1.0))).toDS

dims.show
cent.show
dims.join(cent, dims("dimension") === cent("dimension") ).show

outputs

+-+-+-+
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0|0|  1|  1.0|
+-+-+-+-+---+-+

which is correct. However after writing and reading the data, we see this

dims.write.mode("overwrite").save("/tmp/dims2.parquet")
cent.write.mode("overwrite").save("/tmp/cent2.parquet")

val dims2 = spark.read.load("/tmp/dims2.parquet").as[UserDimensions]
val cent2 = spark.read.load("/tmp/cent2.parquet").as[CentroidClusterScore]

dims2.show
cent2.show

dims2.join(cent2, dims2("dimension") === cent2("dimension") ).show

outputs

+-+-+-+
| user|dimension|score|
+-+-+-+
|12345|0|  1.0|
+-+-+-+

+-+---+-+
|dimension|cluster|score|
+-+---+-+
|0|  1|  1.0|
|1|  0|  1.0|
|2|  2|  1.0|
+-+---+-+

+-+-+-+-+---+-+
| user|dimension|score|dimension|cluster|score|
+-+-+-+-+---+-+
|12345|0|  1.0| null|   null| null|
+-+-+-+-+---+-+

However, using the RDD API produces the correct result

dims2.rdd.map( row => (row.dimension, row) ).join( cent2.rdd.map( row
=> (row.dimension, row) ) ).take(5)

res5: Array[(Long, (UserDimensions, CentroidClusterScore))] =
Array((0,(UserDimensions(12345,0,1.0),CentroidClusterScore(0,1,1.0

We've tried changing the output format to ORC instead of parquet, but we
see the same results. Running Spark 2.0 locally, not on a cluster, does not
have this issue. Also running spark in local mode on the master node of the
Hadoop cluster also works. Only when running on top of YARN do we see this
issue.

This also seems very similar to this issue: https://issues.apache.
org/jira/browse/SPARK-10896
Thoughts?


*Stephen Hankinson*


How to detect when a JavaSparkContext gets stopped

2016-09-05 Thread Hough, Stephen C
I have a long running application, configured to be HA, whereby only the 
designated leader will acquire a JavaSparkContext, listen for requests and push 
jobs onto this context.

The problem I have is, whenever my AWS instances running workers die (either a 
time to live expires or I cancel those instances) it seems that Spark blames my 
driver, I see the following in logs.

org.apache.spark.SparkException: Exiting due to error from cluster scheduler: 
Master removed our application: FAILED

However my application doesn't get a notification so thinks everything is okay, 
until it receives another request and tries to submit to the spark and gets a

java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.

Is there a way I can observe when the JavaSparkContext I own is stopped?

Thanks
Stephen

This email and any attachments are confidential and may also be privileged. If 
you are not the intended recipient, please delete all copies and notify the 
sender immediately. You may wish to refer to the incorporation details of 
Standard Chartered PLC, Standard Chartered Bank and their subsidiaries at 
https://www.sc.com/en/incorporation-details.html

DataFrame equivalent to RDD.partionByKey

2016-08-09 Thread Stephen Fletcher
Is there a DataFrameReader equivalent to the RDD's partitionByKey for RDD?
I'm reading data from a file data source and I want to partition this data
I'm reading in to be partitioned the same way as the data I'm processing
through a spark streaming RDD in the process.


Re: Logging trait in Spark 2.0

2016-06-28 Thread Stephen Boesch
I also did not understand why the Logging class was made private in Spark
2.0.  In a couple of projects including CaffeOnSpark the Logging class was
simply copied to the new project to allow for backwards compatibility.

2016-06-28 18:10 GMT-07:00 Michael Armbrust :

> I'd suggest using the slf4j APIs directly.  They provide a nice stable API
> that works with a variety of logging backends.  This is what Spark does
> internally.
>
> On Sun, Jun 26, 2016 at 4:02 AM, Paolo Patierno 
> wrote:
>
>> Yes ... the same here ... I'd like to know the best way for adding
>> logging in a custom receiver for Spark Streaming 2.0
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>>
>> --
>> From: jonathaka...@gmail.com
>> Date: Fri, 24 Jun 2016 20:56:40 +
>> Subject: Re: Logging trait in Spark 2.0
>> To: yuzhih...@gmail.com; ppatie...@live.com
>> CC: user@spark.apache.org
>>
>>
>> Ted, how is that thread related to Paolo's question?
>>
>> On Fri, Jun 24, 2016 at 1:50 PM Ted Yu  wrote:
>>
>> See this related thread:
>>
>>
>> http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+
>>
>> On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno 
>> wrote:
>>
>> Hi,
>>
>> developing a Spark Streaming custom receiver I noticed that the Logging
>> trait isn't accessible anymore in Spark 2.0.
>>
>> trait Logging in package internal cannot be accessed in package
>> org.apache.spark.internal
>>
>> For developing a custom receiver what is the preferred way for logging ?
>> Just using log4j dependency as any other Java/Scala library/application ?
>>
>> Thanks,
>> Paolo
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>>
>>
>


Custom Optimizer

2016-06-23 Thread Stephen Boesch
My team has a custom optimization routine that we would have wanted to plug
in as a replacement for the default  LBFGS /  OWLQN for use by some of the
ml/mllib algorithms.

However it seems the choice of optimizer is hard-coded in every algorithm
except LDA: and even in that one it is only a choice between the internally
defined Online or batch version.

Any suggestions on how we might be able to incorporate our own optimizer?
Or do we need to roll all of our algorithms from top to bottom - basically
side stepping ml/mllib?

thanks
stephen


Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
I just checked out completely fresh directory and created new IJ project.
Then followed your tip for adding the avro source.

Here is an additional set of errors

Error:(31, 12) object razorvine is not a member of package net
import net.razorvine.pickle.{IObjectPickler, Opcodes, Pickler}
   ^
Error:(779, 49) not found: type IObjectPickler
  class PythonMessageAndMetadataPickler extends IObjectPickler {
^
Error:(783, 7) not found: value Pickler
  Pickler.registerCustomPickler(classOf[PythonMessageAndMetadata], this)
  ^
Error:(784, 7) not found: value Pickler
  Pickler.registerCustomPickler(this.getClass, this)
  ^
Error:(787, 57) not found: type Pickler
def pickle(obj: Object, out: OutputStream, pickler: Pickler) {
^
Error:(789, 19) not found: value Opcodes
out.write(Opcodes.GLOBAL)
  ^
Error:(794, 19) not found: value Opcodes
out.write(Opcodes.MARK)
  ^
Error:(800, 19) not found: value Opcodes
out.write(Opcodes.TUPLE)
  ^
Error:(801, 19) not found: value Opcodes
out.write(Opcodes.REDUCE)
  ^

2016-06-22 23:49 GMT-07:00 Stephen Boesch <java...@gmail.com>:

> Thanks Jeff - I remember that now from long time ago.  After making that
> change the next errors are:
>
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term fasterxml in package com,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term jackson in value com.fasterxml,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.fasterxml.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term annotation in value com.jackson,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.jackson.
> Error:scalac: missing or invalid dependency detected while loading class
> file 'RDDOperationScope.class'.
> Could not access term JsonInclude in value com.annotation,
> because it (or its dependencies) are missing. Check your build definition
> for
> missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
> the problematic classpath.)
> A full rebuild may help if 'RDDOperationScope.class' was compiled against
> an incompatible version of com.annotation.
> Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
> continuing with a stub.
>
>
> 2016-06-22 23:39 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:
>
>> You need to
>> spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
>> under build path, this is the only thing you need to do manually if I
>> remember correctly.
>>
>>
>>
>> On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com>
>> wrote:
>>
>>> Hi Jeff,
>>>   I'd like to understand what may be different. I have rebuilt and
>>> reimported many times.  Just now I blew away the .idea/* and *.iml to start
>>> from scratch.  I just opened the $SPARK_HOME directory from intellij File |
>>> Open  .  After it finished the initial import I tried to run one of the
>>> Examples - and it fails in the build:
>>>
>>> Here are the errors I see:
>>>
>>> Error:(45, 66) not found: type SparkFlumeProtocol
>>>   val transactionTimeout: Int, val backOffInterval: Int) extends
>>> SparkFlumeProtocol with Logging {
>>>  ^
>>> Error:(70, 39) not found: type EventBatch
>>>   override def getEventBatch(n: Int): EventBatch = {
>>>   ^
>>> Error:(85, 13) not found: type EventBatch
>>> new EventBatch("Spark sink has be

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Thanks Jeff - I remember that now from long time ago.  After making that
change the next errors are:

Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term fasterxml in package com,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term jackson in value com.fasterxml,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.fasterxml.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term annotation in value com.jackson,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.jackson.
Error:scalac: missing or invalid dependency detected while loading class
file 'RDDOperationScope.class'.
Could not access term JsonInclude in value com.annotation,
because it (or its dependencies) are missing. Check your build definition
for
missing or conflicting dependencies. (Re-run with `-Ylog-classpath` to see
the problematic classpath.)
A full rebuild may help if 'RDDOperationScope.class' was compiled against
an incompatible version of com.annotation.
Warning:scalac: Class org.jboss.netty.channel.ChannelFactory not found -
continuing with a stub.


2016-06-22 23:39 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:

> You need to
> spark/external/flume-sink/target/scala-2.11/src_managed/main/compiled_avro
> under build path, this is the only thing you need to do manually if I
> remember correctly.
>
>
>
> On Thu, Jun 23, 2016 at 2:30 PM, Stephen Boesch <java...@gmail.com> wrote:
>
>> Hi Jeff,
>>   I'd like to understand what may be different. I have rebuilt and
>> reimported many times.  Just now I blew away the .idea/* and *.iml to start
>> from scratch.  I just opened the $SPARK_HOME directory from intellij File |
>> Open  .  After it finished the initial import I tried to run one of the
>> Examples - and it fails in the build:
>>
>> Here are the errors I see:
>>
>> Error:(45, 66) not found: type SparkFlumeProtocol
>>   val transactionTimeout: Int, val backOffInterval: Int) extends
>> SparkFlumeProtocol with Logging {
>>  ^
>> Error:(70, 39) not found: type EventBatch
>>   override def getEventBatch(n: Int): EventBatch = {
>>   ^
>> Error:(85, 13) not found: type EventBatch
>> new EventBatch("Spark sink has been stopped!", "",
>> java.util.Collections.emptyList())
>> ^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
>> Error:(80, 22) not found: type EventBatch
>>   def getEventBatch: EventBatch = {
>>  ^
>> Error:(48, 37) not found: type EventBatch
>>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
>> Error", "",
>> ^
>> Error:(48, 54) not found: type EventBatch
>>   @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
>> Error", "",
>>  ^
>> Error:(115, 41) not found: type SparkSinkEvent
>> val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
>> ^
>> Error:(146, 28) not found: type EventBatch
>>   eventBatch = new EventBatch("", seqNum, events)
>>^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
>> Error:(25, 27) not found: type EventBatch
>>   def isErrorBatch(batch: EventBatch): Boolean = {
>>   ^
>>
>> /git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
>> Error:(86, 51) not found: type SparkFlumeProtocol
>> val responder

Re: Building Spark 2.X in Intellij

2016-06-23 Thread Stephen Boesch
Hi Jeff,
  I'd like to understand what may be different. I have rebuilt and
reimported many times.  Just now I blew away the .idea/* and *.iml to start
from scratch.  I just opened the $SPARK_HOME directory from intellij File |
Open  .  After it finished the initial import I tried to run one of the
Examples - and it fails in the build:

Here are the errors I see:

Error:(45, 66) not found: type SparkFlumeProtocol
  val transactionTimeout: Int, val backOffInterval: Int) extends
SparkFlumeProtocol with Logging {
 ^
Error:(70, 39) not found: type EventBatch
  override def getEventBatch(n: Int): EventBatch = {
  ^
Error:(85, 13) not found: type EventBatch
new EventBatch("Spark sink has been stopped!", "",
java.util.Collections.emptyList())
^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/TransactionProcessor.scala
Error:(80, 22) not found: type EventBatch
  def getEventBatch: EventBatch = {
 ^
Error:(48, 37) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
^
Error:(48, 54) not found: type EventBatch
  @volatile private var eventBatch: EventBatch = new EventBatch("Unknown
Error", "",
 ^
Error:(115, 41) not found: type SparkSinkEvent
val events = new util.ArrayList[SparkSinkEvent](maxBatchSize)
^
Error:(146, 28) not found: type EventBatch
  eventBatch = new EventBatch("", seqNum, events)
   ^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSinkUtils.scala
Error:(25, 27) not found: type EventBatch
  def isErrorBatch(batch: EventBatch): Boolean = {
  ^
/git/spark/external/flume-sink/src/main/scala/org/apache/spark/streaming/flume/sink/SparkSink.scala
Error:(86, 51) not found: type SparkFlumeProtocol
val responder = new SpecificResponder(classOf[SparkFlumeProtocol],
handler.get)
  ^


Note: this is just the first batch of errors.




2016-06-22 20:50 GMT-07:00 Jeff Zhang <zjf...@gmail.com>:

> It works well with me. You can try reimport it into intellij.
>
> On Thu, Jun 23, 2016 at 10:25 AM, Stephen Boesch <java...@gmail.com>
> wrote:
>
>>
>> Building inside intellij is an ever moving target. Anyone have the
>> magical procedures to get it going for 2.X?
>>
>> There are numerous library references that - although included in the
>> pom.xml build - are for some reason not found when processed within
>> Intellij.
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Building Spark 2.X in Intellij

2016-06-22 Thread Stephen Boesch
Building inside intellij is an ever moving target. Anyone have the magical
procedures to get it going for 2.X?

There are numerous library references that - although included in the
pom.xml build - are for some reason not found when processed within
Intellij.


Notebook(s) for Spark 2.0 ?

2016-06-20 Thread Stephen Boesch
Having looked closely at Jupyter, Zeppelin, and Spark-Notebook : only the
latter seems to be close to having support for Spark 2.X.

While I am interested in using Spark Notebook as soon as that support were
available are there alternatives that work *now*?  For example some
unmerged -yet -working fork(s) ?

thanks


Data Generators mllib -> ml

2016-06-20 Thread Stephen Boesch
There are around twenty data generators in mllib -none of which are
presently migrated to ml.

Here is an example

/**
 * :: DeveloperApi ::
 * Generate sample data used for SVM. This class generates uniform random values
 * for the features and adds Gaussian noise with weight 0.1 to generate labels.
 */
@DeveloperApi
@Since("0.8.0")
object SVMDataGenerator {


Will these be migrated - and is there any indication on a timeframe?
My intention would be to publicize these as non-deprecated for the 2.0
and 2.1 timeframes?


Re: Python to Scala

2016-06-17 Thread Stephen Boesch
What are you expecting us to do?  Yash provided a reasonable approach -
based on the info you had provided in prior emails.  Otherwise you can
convert it from python to spark - or find someone else who feels
comfortable to do it.  That kind of inquiry would likelybe appropriate on a
job board.



2016-06-17 21:47 GMT-07:00 Aakash Basu :

> Hey,
>
> Our complete project is in Spark on Scala, I code in Scala for Spark,
> though am new, but I know it and still learning. But I need help in
> converting this code to Scala. I've nearly no knowledge in Python, hence,
> requested the experts here.
>
> Hope you get me now.
>
> Thanks,
> Aakash.
> On 18-Jun-2016 10:07 AM, "Yash Sharma"  wrote:
>
>> You could use pyspark to run the python code on spark directly. That will
>> cut the effort of learning scala.
>>
>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html
>>
>> - Thanks, via mobile,  excuse brevity.
>> On Jun 18, 2016 2:34 PM, "Aakash Basu"  wrote:
>>
>>> Hi all,
>>>
>>> I've a python code, which I want to convert to Scala for using it in a
>>> Spark program. I'm not so well acquainted with python and learning scala
>>> now. Any Python+Scala expert here? Can someone help me out in this please?
>>>
>>> Thanks & Regards,
>>> Aakash.
>>>
>>


  1   2   3   >