Re: Spark master build hangs using parallel build option in maven

2020-01-17 Thread Dongjoon Hyun
Hi, Saurabh.

It seems that you are hitting
https://issues.apache.org/jira/browse/SPARK-26095 .

And, we disabled the parallel build via
https://github.com/apache/spark/pull/23061 at 3.0.0.

According to the stack trace in JIRA and PR description,
`maven-shade-plugin` seems to be the root cause.

For now, I'd like to recommend you to disable it because `Maven` itself
warns you already. (You know that, right?)

[INFO] [ pom
> ]-
> [WARNING] *
> [WARNING] * Your build is requesting parallel execution, but project  *
> [WARNING] * contains the following plugin(s) that have goals not marked   *
> [WARNING] * as @threadSafe to support parallel building.  *
> [WARNING] * While this /may/ work fine, please look for plugin updates*
> [WARNING] * and/or request plugins be made thread-safe.   *
> [WARNING] * If reporting an issue, report it against the plugin in*
> [WARNING] * question, not against maven-core  *
> [WARNING] *
> [WARNING] The following plugins are not marked @threadSafe in Spark
> Project Parent POM:
> [WARNING] org.scalatest:scalatest-maven-plugin:2.0.0
> [WARNING] Enable debug to see more precisely which goals are not marked
> @threadSafe.
> [WARNING] *


I respect `Maven` warnings.

Bests,
Dongjoon.


On Fri, Jan 17, 2020 at 9:22 PM Saurabh Chawla 
wrote:

> Hi Sean,
>
> Thanks for checking this.
>
> I am able to see parallel build info in the readme file
> https://github.com/apache/spark#building-spark
>
> "
> You can build Spark using more than one thread by using the -T option with
> Maven, see "Parallel builds in Maven 3"
> .
> More detailed documentation is available from the project site, at "Building
> Spark" .
> "
>
> This used to work while building older version of spark(2.4.3, 2.3.2 etc).
> build/mvn -Duse.zinc.server=false -DuseZincForJdk8=false
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
> -DskipTests=true -T 4 clean package
>
> Also I have seen the maven version is changed from 3.5.4 to 3.6.3 in
> master branch compared to spark 2.4.3.
> Not sure if it's due to some bug in maven version used in master or some
> new change added in the master branch that prevent the parallel build.
>
> Regards
> Saurabh Chawla
>
>
> On Sat, Jan 18, 2020 at 2:19 AM Sean Owen  wrote:
>
>> I don't believe you can use a parallel build indeed. Some things
>> collide with each other. Some of the suites are run in parallel inside
>> the build though already.
>>
>> On Fri, Jan 17, 2020 at 1:23 PM Saurabh Chawla 
>> wrote:
>> >
>> > Hi All,
>> >
>> > Spark master build hangs using parallel build option in maven. On
>> running build the sequentially on spark master using maven, build did not
>> hang. This issue occurs on giving hadoop-provided (-Phadoop-provided
>> -Dhadoop.version=2.8.5) option. Same command works fine to build
>> spark-2.4.3 parallelly
>> >
>> > Command to build spark master sequentially - Spark build works fine
>> > build/mvn  -Duse.zinc.server=false -DuseZincForJdk8=false
>> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
>> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
>> -DskipTests=true  clean package
>> >
>> > Command to build spark master parallel - spark build hangs
>> > build/mvn -X -Duse.zinc.server=false -DuseZincForJdk8=false
>> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
>> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
>> -DskipTests=true -T 4 clean package
>> >
>> > This is the trace which keeps on repeating in maven logs
>> >
>> > [DEBUG] building maven31 dependency graph for
>> org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT with
>> Maven31DependencyGraphBuilder
>> > [DEBUG] Dependency collection stats: {ConflictMarker.analyzeTime=60583,
>> ConflictMarker.markTime=23750, ConflictMarker.nodeCount=419,
>> ConflictIdSorter.graphTime=41262, ConflictIdSorter.topsortTime=9704,
>> ConflictIdSorter.conflictIdCount=105,
>> ConflictIdSorter.conflictIdCycleCount=0, ConflictResolver.totalTime=632542,
>> ConflictResolver.conflictItemCount=193,
>> DefaultDependencyCollector.collectTime=1020759,
>> DefaultDependencyCollector.transformTime=775495}
>> > [DEBUG] org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT
>> > [DEBUG]
>> org.apache.spark:spark-network-shuffle_2.12:jar:3.0.0-SNAPSHOT:compile
>> > [DEBUG]
>>  org.apache.spark:spark-network-common_2.12:jar:3.0.0-SNAPSHOT:compile
>> > [DEBUG]  

Re: Spark master build hangs using parallel build option in maven

2020-01-17 Thread Saurabh Chawla
Hi Sean,

Thanks for checking this.

I am able to see parallel build info in the readme file
https://github.com/apache/spark#building-spark

"
You can build Spark using more than one thread by using the -T option with
Maven, see "Parallel builds in Maven 3"
.
More detailed documentation is available from the project site, at "Building
Spark" .
"

This used to work while building older version of spark(2.4.3, 2.3.2 etc).
build/mvn -Duse.zinc.server=false -DuseZincForJdk8=false
-Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
-Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
-DskipTests=true -T 4 clean package

Also I have seen the maven version is changed from 3.5.4 to 3.6.3 in master
branch compared to spark 2.4.3.
Not sure if it's due to some bug in maven version used in master or some
new change added in the master branch that prevent the parallel build.

Regards
Saurabh Chawla


On Sat, Jan 18, 2020 at 2:19 AM Sean Owen  wrote:

> I don't believe you can use a parallel build indeed. Some things
> collide with each other. Some of the suites are run in parallel inside
> the build though already.
>
> On Fri, Jan 17, 2020 at 1:23 PM Saurabh Chawla 
> wrote:
> >
> > Hi All,
> >
> > Spark master build hangs using parallel build option in maven. On
> running build the sequentially on spark master using maven, build did not
> hang. This issue occurs on giving hadoop-provided (-Phadoop-provided
> -Dhadoop.version=2.8.5) option. Same command works fine to build
> spark-2.4.3 parallelly
> >
> > Command to build spark master sequentially - Spark build works fine
> > build/mvn  -Duse.zinc.server=false -DuseZincForJdk8=false
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
> -DskipTests=true  clean package
> >
> > Command to build spark master parallel - spark build hangs
> > build/mvn -X -Duse.zinc.server=false -DuseZincForJdk8=false
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
> -DskipTests=true -T 4 clean package
> >
> > This is the trace which keeps on repeating in maven logs
> >
> > [DEBUG] building maven31 dependency graph for
> org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT with
> Maven31DependencyGraphBuilder
> > [DEBUG] Dependency collection stats: {ConflictMarker.analyzeTime=60583,
> ConflictMarker.markTime=23750, ConflictMarker.nodeCount=419,
> ConflictIdSorter.graphTime=41262, ConflictIdSorter.topsortTime=9704,
> ConflictIdSorter.conflictIdCount=105,
> ConflictIdSorter.conflictIdCycleCount=0, ConflictResolver.totalTime=632542,
> ConflictResolver.conflictItemCount=193,
> DefaultDependencyCollector.collectTime=1020759,
> DefaultDependencyCollector.transformTime=775495}
> > [DEBUG] org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT
> > [DEBUG]
> org.apache.spark:spark-network-shuffle_2.12:jar:3.0.0-SNAPSHOT:compile
> > [DEBUG]
>  org.apache.spark:spark-network-common_2.12:jar:3.0.0-SNAPSHOT:compile
> > [DEBUG]  io.netty:netty-all:jar:4.1.42.Final:compile (version
> managed from 4.1.42.Final)
> > [DEBUG]  org.apache.commons:commons-lang3:jar:3.9:compile
> (version managed from 3.9)
> > [DEBUG]
> org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile (version managed
> from 1.8)
> > [DEBUG]
> com.fasterxml.jackson.core:jackson-databind:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]
>  com.fasterxml.jackson.core:jackson-core:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]
> com.fasterxml.jackson.core:jackson-annotations:jar:2.10.0:compile (version
> managed from 2.10.0)
> > [DEBUG]  com.google.code.findbugs:jsr305:jar:3.0.0:compile
> (version managed from 3.0.0)
> > [DEBUG]  com.google.guava:guava:jar:14.0.1:provided (scope
> managed from compile) (version managed from 14.0.1)
> > [DEBUG]  org.apache.commons:commons-crypto:jar:1.0.0:compile
> (version managed from 1.0.0) (exclusions managed from
> [net.java.dev.jna:jna:*:*])
> > [DEBUG]   io.dropwizard.metrics:metrics-core:jar:4.1.1:compile
> (version managed from 4.1.1)
> > [DEBUG]org.apache.spark:spark-tags_2.12:jar:3.0.0-SNAPSHOT:test
> > [DEBUG]   org.scala-lang:scala-library:jar:2.12.10:compile (version
> managed from 2.12.10)
> > [DEBUG]org.apache.spark:spark-tags_2.12:jar:tests:3.0.0-SNAPSHOT:test
> > [DEBUG]org.apache.hadoop:hadoop-client:jar:2.8.5:provided
> (exclusions managed from [org.fusesource.leveldbjni:leveldbjni-all:*:*,
> asm:asm:*:*, org.codehaus.jackson:jackson-mapper-asl:*:*,
> org.ow2.asm:asm:*:*, org.jboss.netty:netty:*:*, io.netty:netty:*:*,
> commons-beanutils:commons-beanutils-core:*:*,
> commons-logging:commons-logging:*:*, org.mockito:mockito-all:*:*,
> 

Subscribe to spark-dev

2020-01-17 Thread Chandni Singh
Please add me to spark-dev mailing list.


Re: Spark master build hangs using parallel build option in maven

2020-01-17 Thread Sean Owen
I don't believe you can use a parallel build indeed. Some things
collide with each other. Some of the suites are run in parallel inside
the build though already.

On Fri, Jan 17, 2020 at 1:23 PM Saurabh Chawla  wrote:
>
> Hi All,
>
> Spark master build hangs using parallel build option in maven. On running 
> build the sequentially on spark master using maven, build did not hang. This 
> issue occurs on giving hadoop-provided (-Phadoop-provided 
> -Dhadoop.version=2.8.5) option. Same command works fine to build spark-2.4.3 
> parallelly
>
> Command to build spark master sequentially - Spark build works fine
> build/mvn  -Duse.zinc.server=false -DuseZincForJdk8=false 
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver 
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5 
> -DskipTests=true  clean package
>
> Command to build spark master parallel - spark build hangs
> build/mvn -X -Duse.zinc.server=false -DuseZincForJdk8=false 
> -Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver 
> -Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5 
> -DskipTests=true -T 4 clean package
>
> This is the trace which keeps on repeating in maven logs
>
> [DEBUG] building maven31 dependency graph for 
> org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT with 
> Maven31DependencyGraphBuilder
> [DEBUG] Dependency collection stats: {ConflictMarker.analyzeTime=60583, 
> ConflictMarker.markTime=23750, ConflictMarker.nodeCount=419, 
> ConflictIdSorter.graphTime=41262, ConflictIdSorter.topsortTime=9704, 
> ConflictIdSorter.conflictIdCount=105, 
> ConflictIdSorter.conflictIdCycleCount=0, ConflictResolver.totalTime=632542, 
> ConflictResolver.conflictItemCount=193, 
> DefaultDependencyCollector.collectTime=1020759, 
> DefaultDependencyCollector.transformTime=775495}
> [DEBUG] org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT
> [DEBUG]
> org.apache.spark:spark-network-shuffle_2.12:jar:3.0.0-SNAPSHOT:compile
> [DEBUG]   
> org.apache.spark:spark-network-common_2.12:jar:3.0.0-SNAPSHOT:compile
> [DEBUG]  io.netty:netty-all:jar:4.1.42.Final:compile (version managed 
> from 4.1.42.Final)
> [DEBUG]  org.apache.commons:commons-lang3:jar:3.9:compile (version 
> managed from 3.9)
> [DEBUG]  org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile 
> (version managed from 1.8)
> [DEBUG]  
> com.fasterxml.jackson.core:jackson-databind:jar:2.10.0:compile (version 
> managed from 2.10.0)
> [DEBUG] 
> com.fasterxml.jackson.core:jackson-core:jar:2.10.0:compile (version managed 
> from 2.10.0)
> [DEBUG]  
> com.fasterxml.jackson.core:jackson-annotations:jar:2.10.0:compile (version 
> managed from 2.10.0)
> [DEBUG]  com.google.code.findbugs:jsr305:jar:3.0.0:compile (version 
> managed from 3.0.0)
> [DEBUG]  com.google.guava:guava:jar:14.0.1:provided (scope managed 
> from compile) (version managed from 14.0.1)
> [DEBUG]  org.apache.commons:commons-crypto:jar:1.0.0:compile (version 
> managed from 1.0.0) (exclusions managed from [net.java.dev.jna:jna:*:*])
> [DEBUG]   io.dropwizard.metrics:metrics-core:jar:4.1.1:compile (version 
> managed from 4.1.1)
> [DEBUG]org.apache.spark:spark-tags_2.12:jar:3.0.0-SNAPSHOT:test
> [DEBUG]   org.scala-lang:scala-library:jar:2.12.10:compile (version 
> managed from 2.12.10)
> [DEBUG]org.apache.spark:spark-tags_2.12:jar:tests:3.0.0-SNAPSHOT:test
> [DEBUG]org.apache.hadoop:hadoop-client:jar:2.8.5:provided (exclusions 
> managed from [org.fusesource.leveldbjni:leveldbjni-all:*:*, asm:asm:*:*, 
> org.codehaus.jackson:jackson-mapper-asl:*:*, org.ow2.asm:asm:*:*, 
> org.jboss.netty:netty:*:*, io.netty:netty:*:*, 
> commons-beanutils:commons-beanutils-core:*:*, 
> commons-logging:commons-logging:*:*, org.mockito:mockito-all:*:*, 
> org.mortbay.jetty:servlet-api-2.5:*:*, javax.servlet:servlet-api:*:*, 
> junit:junit:*:*, com.sun.jersey:*:*:*, 
> com.sun.jersey.jersey-test-framework:*:*:*, com.sun.jersey.contribs:*:*:*, 
> net.java.dev.jets3t:jets3t:*:*, javax.ws.rs:jsr311-api:*:*, 
> org.eclipse.jetty:jetty-webapp:*:*])
> [DEBUG]   org.apache.hadoop:hadoop-common:jar:2.8.5:provided
> [DEBUG]  com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19:provided
> [DEBUG]  commons-cli:commons-cli:jar:1.2:provided
> [DEBUG]  org.apache.commons:commons-math3:jar:3.4.1:provided (version 
> managed from 3.1.1)
> [DEBUG]  org.apache.httpcomponents:httpclient:jar:4.5.6:provided 
> (version managed from 4.5.2)
> [DEBUG] org.apache.httpcomponents:httpcore:jar:4.4.12:provided 
> (version managed from 4.4.10)
> [DEBUG]  commons-codec:commons-codec:jar:1.10:provided (version 
> managed from 1.11)
> [DEBUG]  commons-io:commons-io:jar:2.4:provided (version managed from 
> 2.5)
> [DEBUG]  commons-net:commons-net:jar:3.1:provided (version managed 
> from 3.6)
> [DEBUG]  

unsubscribe

2020-01-17 Thread Bruno S. de Barros


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Spark master build hangs using parallel build option in maven

2020-01-17 Thread Saurabh Chawla
Hi All,

Spark master build hangs using parallel build option in maven. On running
build the sequentially on spark master using maven, build did not hang.
This issue occurs on giving hadoop-provided (*-Phadoop-provided
-Dhadoop.version=2.8.5) *option. Same command works fine to build
spark-2.4.3 parallelly

*Command to build spark master sequentially - *Spark build works fine
build/mvn  -Duse.zinc.server=false -DuseZincForJdk8=false
-Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
-Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
-DskipTests=true  clean package

*Command to build spark master parallel - *spark build hangs
build/mvn -X -Duse.zinc.server=false -DuseZincForJdk8=false
-Dmaven.javadoc.skip=true -DskipSource=true -Phive -Phive-thriftserver
-Phive-provided -Pyarn -Phadoop-provided -Dhadoop.version=2.8.5
-DskipTests=true -T 4 clean package

This is the trace which keeps on repeating in maven logs

[DEBUG] building maven31 dependency graph for
org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT with
Maven31DependencyGraphBuilder
[DEBUG] Dependency collection stats: {ConflictMarker.analyzeTime=60583,
ConflictMarker.markTime=23750, ConflictMarker.nodeCount=419,
ConflictIdSorter.graphTime=41262, ConflictIdSorter.topsortTime=9704,
ConflictIdSorter.conflictIdCount=105,
ConflictIdSorter.conflictIdCycleCount=0, ConflictResolver.totalTime=632542,
ConflictResolver.conflictItemCount=193,
DefaultDependencyCollector.collectTime=1020759,
DefaultDependencyCollector.transformTime=775495}
[DEBUG] org.apache.spark:spark-network-yarn_2.12:jar:3.0.0-SNAPSHOT
[DEBUG]
 org.apache.spark:spark-network-shuffle_2.12:jar:3.0.0-SNAPSHOT:compile
[DEBUG]
org.apache.spark:spark-network-common_2.12:jar:3.0.0-SNAPSHOT:compile
[DEBUG]  io.netty:netty-all:jar:4.1.42.Final:compile (version
managed from 4.1.42.Final)
[DEBUG]  org.apache.commons:commons-lang3:jar:3.9:compile (version
managed from 3.9)
[DEBUG]  org.fusesource.leveldbjni:leveldbjni-all:jar:1.8:compile
(version managed from 1.8)
[DEBUG]
 com.fasterxml.jackson.core:jackson-databind:jar:2.10.0:compile (version
managed from 2.10.0)
[DEBUG]
com.fasterxml.jackson.core:jackson-core:jar:2.10.0:compile (version managed
from 2.10.0)
[DEBUG]
 com.fasterxml.jackson.core:jackson-annotations:jar:2.10.0:compile (version
managed from 2.10.0)
[DEBUG]  com.google.code.findbugs:jsr305:jar:3.0.0:compile (version
managed from 3.0.0)
[DEBUG]  com.google.guava:guava:jar:14.0.1:provided (scope managed
from compile) (version managed from 14.0.1)
[DEBUG]  org.apache.commons:commons-crypto:jar:1.0.0:compile
(version managed from 1.0.0) (exclusions managed from
[net.java.dev.jna:jna:*:*])
[DEBUG]   io.dropwizard.metrics:metrics-core:jar:4.1.1:compile (version
managed from 4.1.1)
[DEBUG]org.apache.spark:spark-tags_2.12:jar:3.0.0-SNAPSHOT:test
[DEBUG]   org.scala-lang:scala-library:jar:2.12.10:compile (version
managed from 2.12.10)
[DEBUG]org.apache.spark:spark-tags_2.12:jar:tests:3.0.0-SNAPSHOT:test
[DEBUG]org.apache.hadoop:hadoop-client:jar:2.8.5:provided (exclusions
managed from [org.fusesource.leveldbjni:leveldbjni-all:*:*, asm:asm:*:*,
org.codehaus.jackson:jackson-mapper-asl:*:*, org.ow2.asm:asm:*:*,
org.jboss.netty:netty:*:*, io.netty:netty:*:*,
commons-beanutils:commons-beanutils-core:*:*,
commons-logging:commons-logging:*:*, org.mockito:mockito-all:*:*,
org.mortbay.jetty:servlet-api-2.5:*:*, javax.servlet:servlet-api:*:*,
junit:junit:*:*, com.sun.jersey:*:*:*,
com.sun.jersey.jersey-test-framework:*:*:*, com.sun.jersey.contribs:*:*:*,
net.java.dev.jets3t:jets3t:*:*, javax.ws.rs:jsr311-api:*:*,
org.eclipse.jetty:jetty-webapp:*:*])
[DEBUG]   org.apache.hadoop:hadoop-common:jar:2.8.5:provided
[DEBUG]  com.hadoop.gplcompression:hadoop-lzo:jar:0.4.19:provided
[DEBUG]  commons-cli:commons-cli:jar:1.2:provided
[DEBUG]  org.apache.commons:commons-math3:jar:3.4.1:provided
(version managed from 3.1.1)
[DEBUG]  org.apache.httpcomponents:httpclient:jar:4.5.6:provided
(version managed from 4.5.2)
[DEBUG] org.apache.httpcomponents:httpcore:jar:4.4.12:provided
(version managed from 4.4.10)
[DEBUG]  commons-codec:commons-codec:jar:1.10:provided (version
managed from 1.11)
[DEBUG]  commons-io:commons-io:jar:2.4:provided (version managed
from 2.5)
[DEBUG]  commons-net:commons-net:jar:3.1:provided (version managed
from 3.6)
[DEBUG]  commons-collections:commons-collections:jar:3.2.2:provided
(version managed from 3.2.2)
[DEBUG]
 org.eclipse.jetty:jetty-servlet:jar:9.4.18.v20190429:provided (scope
managed from compile) (version managed from 9.3.19.v20170502)
[DEBUG]
org.eclipse.jetty:jetty-security:jar:9.4.18.v20190429:provided (scope
managed from compile) (version managed from 9.4.18.v20190429)
[DEBUG]  javax.servlet.jsp:jsp-api:jar:2.1:provided
[DEBUG]  log4j:log4j:jar:1.2.17:provided (scope managed from

Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Ryan Blue
We've implemented these metrics in the RDD (for input metrics) and in the
v2 DataWritingSparkTask. That approach gives you the same metrics in the
stage views that you get with v1 sources, regardless of the v2
implementation.

I'm not sure why they weren't included from the start. It looks like the
way metrics are collected is changing. There are a couple of metrics for
number of rows; looks like one that goes to the Spark SQL tab and one that
is used for the stages view.

If you'd like, I can send you a patch.

rb

On Fri, Jan 17, 2020 at 5:09 AM Wenchen Fan  wrote:

> I think there are a few details we need to discuss.
>
> how frequently a source should update its metrics? For example, if file
> source needs to report size metrics per row, it'll be super slow.
>
> what metrics a source should report? data size? numFiles? read time?
>
> shall we show metrics in SQL web UI as well?
>
> On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <
> sandeep0102.opensou...@gmail.com> wrote:
>
>> Hi Devs,
>>
>> Currently DS V2 does not update any input metrics. SPARK-30362 aims at
>> solving this problem.
>>
>> We can have the below approach. Have marker interface let's say
>> "ReportMetrics"
>>
>> If the DataSource Implements this interface, then it will be easy to
>> collect the metrics.
>>
>> For e.g. FilePartitionReaderFactory can support metrics.
>>
>> So it will be easy to collect the metrics if FilePartitionReaderFactory
>> implements ReportMetrics
>>
>>
>> Please let me know the views, or even if we want to have new solution or
>> design.
>>
>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [Discuss] Metrics Support for DS V2

2020-01-17 Thread Wenchen Fan
I think there are a few details we need to discuss.

how frequently a source should update its metrics? For example, if file
source needs to report size metrics per row, it'll be super slow.

what metrics a source should report? data size? numFiles? read time?

shall we show metrics in SQL web UI as well?

On Fri, Jan 17, 2020 at 3:07 PM Sandeep Katta <
sandeep0102.opensou...@gmail.com> wrote:

> Hi Devs,
>
> Currently DS V2 does not update any input metrics. SPARK-30362 aims at
> solving this problem.
>
> We can have the below approach. Have marker interface let's say
> "ReportMetrics"
>
> If the DataSource Implements this interface, then it will be easy to
> collect the metrics.
>
> For e.g. FilePartitionReaderFactory can support metrics.
>
> So it will be easy to collect the metrics if FilePartitionReaderFactory
> implements ReportMetrics
>
>
> Please let me know the views, or even if we want to have new solution or
> design.
>


unsubscribe

2020-01-17 Thread Pingxiao Ye



Re: How to implement a "saveAsBinaryFile" function?

2020-01-17 Thread Duan,Bing
Hi Fokko, Maxim, Long:

Thanks!

This reading has been occurred in a custom datasource as below:

override def createRelation(…) {
…
blocks.map(block => (block.bytes)).saveAsTextFile(parameters("path”))
...
}

I am a new Sparker,  will try the those methods you guys provides.

Best!

Bing.

On Jan 17, 2020, at 4:28 AM, Maxim Gekk 
mailto:maxim.g...@databricks.com>> wrote:

Hi Bing,

You can try Text datasource. It shouldn't modify strings:
scala> 
Seq(20192_1",1,24,0,2,”S66.000x001”""").toDS.write.text("tmp/text.txt")
$ cat tmp/text.txt/part-0-256d960f-9f85-47fe-8edd-8428276eb3c6-c000.txt
"20192_1",1,24,0,2,”S66.000x001”

Maxim Gekk
Software Engineer
Databricks B. V. 
[http://go.databricks.com/hubfs/emails/Databricks-logo-bug.png] 



On Thu, Jan 16, 2020 at 10:02 PM Long, Andrew 
mailto:loand...@amazon.com.invalid>> wrote:
Hey Bing,

There’s a couple different approaches you could take.  The quickest and easiest 
would be to use the existing APIs

val bytes = spark.range(1000

bytes.foreachPartition(bytes =>{
  //W ARNING anything used in here will need to be serializable.
  // There's some magic to serializing the hadoop conf. see the hadoop wrapper 
class in the source
  val writer = FileSystem.get(null).create(new Path("s3://..."))
  bytes.foreach(b => writer.write(b))
  writer.close()
})

The more complicated but pretty approach would be to either implement a custom 
datasource.

From: "Duan,Bing" mailto:duanb...@baidu.com>>
Date: Thursday, January 16, 2020 at 12:35 AM
To: "dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: How to implement a "saveAsBinaryFile" function?

Hi all:

I read binary data(protobuf format) from filesystem by binaryFiles function to 
a RDD[Array[Byte]]   it works fine. But when I save the it to filesystem by 
saveAsTextFile, the quotation mark was be escaped like this:
"\"20192_1\"",1,24,0,2,"\"S66.000x001\””,which  should be 
"20192_1",1,24,0,2,”S66.000x001”.

Anyone could give me some tip to implement a function like saveAsBinaryFile to 
persist the RDD[Array[Byte]]?

Bests!

Bing