[jira] [Assigned] (BEAM-3658) Port SpannerIOReadTest off DoFnTester

2019-10-19 Thread Saikat Maitra (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-3658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saikat Maitra reassigned BEAM-3658:
---

Assignee: Saikat Maitra

> Port SpannerIOReadTest off DoFnTester
> -
>
> Key: BEAM-3658
> URL: https://issues.apache.org/jira/browse/BEAM-3658
> Project: Beam
>  Issue Type: Sub-task
>  Components: io-java-gcp
>Reporter: Kenneth Knowles
>Assignee: Saikat Maitra
>Priority: Major
>  Labels: beginner, newbie, starter
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-6766) Sort Merge Bucket Join support in Beam

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6766?focusedWorklogId=331018=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-331018
 ]

ASF GitHub Bot logged work on BEAM-6766:


Author: ASF GitHub Bot
Created on: 20/Oct/19 02:55
Start Date: 20/Oct/19 02:55
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #9251: [BEAM-6766] 
Implement SMB source
URL: https://github.com/apache/beam/pull/9251#issuecomment-544215560
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 331018)
Time Spent: 5.5h  (was: 5h 20m)

> Sort Merge Bucket Join support in Beam
> --
>
> Key: BEAM-6766
> URL: https://issues.apache.org/jira/browse/BEAM-6766
> Project: Beam
>  Issue Type: Improvement
>  Components: extensions-java-join-library, io-ideas
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> Design doc: 
> https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit#
> Hi! Spotify has been internally prototyping and testing an implementation of 
> the sort merge join using Beam primitives and we're interested in 
> contributing it open-source – probably to Beam's extensions package in its 
> own `smb` module or as part of the joins module?
> We've tested this with Avro files using Avro's GenericDatumWriter/Reader 
> directly (although this could theoretically be expanded to other 
> serialization formats). We'd add two transforms*, an SMB write and an SMB 
> join. 
> SMB write would take in one PCollection and a # of buckets and:
> 1) Apply a partitioning function to the input to assign each record to one 
> bucket. (the user code would have to statically specify a # of buckets... 
> hard to see a way to do this dynamically.)
> 2) Group by that bucket ID and within each bucket perform an in-memory sort 
> on join key. If the grouped records are too large to fit in memory, fall back 
> to an external sort (although if this happens, user should probably increase 
> bucket size so every group fits in memory).
> 3) Directly write the contents of bucket to a sequentially named file.
> 4) Write a metadata file to the same output path with info about hash 
> algorithm/# buckets.
> SMB join would take in the input paths for 2 or more Sources, all of which 
> are written in a bucketed and partitioned way, and :
> 1) Verify that the metadata files have compatible bucket # and hash algorithm.
> 2) Expand the input paths to enumerate the `ResourceIds` of every file in the 
> paths. Group all inputs with the same bucket ID.
> 3) Within each group, open a file reader on all `ResourceIds`. Sequentially 
> read files one record at a time, outputting tuples of all record pairs with 
> matching join key.
>  \* These could be implemented either directly as `PTransforms` with the 
> writer being a `DoFn` but I semantically do like the idea of extending 
> `FileBasedSource`/`Sink` with abstract classes like 
> `SortedBucketSink`/`SortedBucketSource`... if we represent the elements in a 
> sink as KV pairs of >>, so that the # 
> of elements in the PCollection == # of buckets == # of output files, we could 
> just implement something like `SortedBucketSink` extending `FileBasedSink` 
> with a dynamic file naming function. I'd like to be able to take advantage of 
> the existing write/read implementation logic in the `io` package as much as 
> possible although I guess some of those are package private. 
> –
> From our internal testing, we've seen some substantial performance 
> improvements using the right bucket size--not only by avoiding a shuffle 
> during the join step, but also in storage costs, since we're getting better 
> compression in Avro by storing sorted records.
> Please let us know what you think/any concerns we can address! Our 
> implementation isn't quite production-ready yet, but we'd like to start a 
> discussion about it early.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-6766) Sort Merge Bucket Join support in Beam

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-6766?focusedWorklogId=331019=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-331019
 ]

ASF GitHub Bot logged work on BEAM-6766:


Author: ASF GitHub Bot
Created on: 20/Oct/19 02:55
Start Date: 20/Oct/19 02:55
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #8824: [BEAM-6766] 
Implement SMB file operations
URL: https://github.com/apache/beam/pull/8824#issuecomment-544215565
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 331019)
Time Spent: 5h 40m  (was: 5.5h)

> Sort Merge Bucket Join support in Beam
> --
>
> Key: BEAM-6766
> URL: https://issues.apache.org/jira/browse/BEAM-6766
> Project: Beam
>  Issue Type: Improvement
>  Components: extensions-java-join-library, io-ideas
>Reporter: Claire McGinty
>Assignee: Claire McGinty
>Priority: Major
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> Design doc: 
> https://docs.google.com/document/d/1AQlonN8t4YJrARcWzepyP7mWHTxHAd6WIECwk1s3LQQ/edit#
> Hi! Spotify has been internally prototyping and testing an implementation of 
> the sort merge join using Beam primitives and we're interested in 
> contributing it open-source – probably to Beam's extensions package in its 
> own `smb` module or as part of the joins module?
> We've tested this with Avro files using Avro's GenericDatumWriter/Reader 
> directly (although this could theoretically be expanded to other 
> serialization formats). We'd add two transforms*, an SMB write and an SMB 
> join. 
> SMB write would take in one PCollection and a # of buckets and:
> 1) Apply a partitioning function to the input to assign each record to one 
> bucket. (the user code would have to statically specify a # of buckets... 
> hard to see a way to do this dynamically.)
> 2) Group by that bucket ID and within each bucket perform an in-memory sort 
> on join key. If the grouped records are too large to fit in memory, fall back 
> to an external sort (although if this happens, user should probably increase 
> bucket size so every group fits in memory).
> 3) Directly write the contents of bucket to a sequentially named file.
> 4) Write a metadata file to the same output path with info about hash 
> algorithm/# buckets.
> SMB join would take in the input paths for 2 or more Sources, all of which 
> are written in a bucketed and partitioned way, and :
> 1) Verify that the metadata files have compatible bucket # and hash algorithm.
> 2) Expand the input paths to enumerate the `ResourceIds` of every file in the 
> paths. Group all inputs with the same bucket ID.
> 3) Within each group, open a file reader on all `ResourceIds`. Sequentially 
> read files one record at a time, outputting tuples of all record pairs with 
> matching join key.
>  \* These could be implemented either directly as `PTransforms` with the 
> writer being a `DoFn` but I semantically do like the idea of extending 
> `FileBasedSource`/`Sink` with abstract classes like 
> `SortedBucketSink`/`SortedBucketSource`... if we represent the elements in a 
> sink as KV pairs of >>, so that the # 
> of elements in the PCollection == # of buckets == # of output files, we could 
> just implement something like `SortedBucketSink` extending `FileBasedSink` 
> with a dynamic file naming function. I'd like to be able to take advantage of 
> the existing write/read implementation logic in the `io` package as much as 
> possible although I guess some of those are package private. 
> –
> From our internal testing, we've seen some substantial performance 
> improvements using the right bucket size--not only by avoiding a shuffle 
> during the join step, but also in storage costs, since we're getting better 
> compression in Avro by storing sorted records.
> Please let us know what you think/any concerns we can address! Our 
> implementation isn't quite production-ready yet, but we'd like to start a 
> discussion about it early.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-8042) Parsing of aggregate query fails

2019-10-19 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang reassigned BEAM-8042:
--

Assignee: (was: Rui Wang)

> Parsing of aggregate query fails
> 
>
> Key: BEAM-8042
> URL: https://issues.apache.org/jira/browse/BEAM-8042
> Project: Beam
>  Issue Type: Sub-task
>  Components: dsl-sql-zetasql
>Reporter: Rui Wang
>Priority: Critical
>
> {code}
>   @Rule
>   public TestPipeline pipeline = 
> TestPipeline.fromOptions(createPipelineOptions());
>   private static PipelineOptions createPipelineOptions() {
> BeamSqlPipelineOptions opts = 
> PipelineOptionsFactory.create().as(BeamSqlPipelineOptions.class);
> opts.setPlannerName(ZetaSQLQueryPlanner.class.getName());
> return opts;
>   }
>   @Test
>   public void testAggregate() {
> Schema inputSchema = Schema.builder()
> .addByteArrayField("id")
> .addInt64Field("has_f1")
> .addInt64Field("has_f2")
> .addInt64Field("has_f3")
> .addInt64Field("has_f4")
> .addInt64Field("has_f5")
> .addInt64Field("has_f6")
> .build();
> String sql = "SELECT \n" +
> "  id, \n" +
> "  COUNT(*) as count, \n" +
> "  SUM(has_f1) as f1_count, \n" +
> "  SUM(has_f2) as f2_count, \n" +
> "  SUM(has_f3) as f3_count, \n" +
> "  SUM(has_f4) as f4_count, \n" +
> "  SUM(has_f5) as f5_count, \n" +
> "  SUM(has_f6) as f6_count  \n" +
> "FROM PCOLLECTION \n" +
> "GROUP BY id";
> pipeline
> .apply(Create.empty(inputSchema))
> .apply(SqlTransform.query(sql));
> pipeline.run();
>   }
> {code}
> {code}
> Caused by: java.lang.RuntimeException: Error while applying rule 
> AggregateProjectMergeRule, args 
> [rel#553:LogicalAggregate.NONE(input=RelSubset#552,group={0},f1=COUNT(),f2=SUM($2),f3=SUM($3),f4=SUM($4),f5=SUM($5),f6=SUM($6),f7=SUM($7)),
>  
> rel#551:LogicalProject.NONE(input=RelSubset#550,key=$0,f1=$1,f2=$2,f3=$3,f4=$4,f5=$5,f6=$6)]
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:232)
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.plan.volcano.VolcanoPlanner.findBestExp(VolcanoPlanner.java:637)
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.tools.Programs$RuleSetProgram.run(Programs.java:340)
>   at 
> org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLPlannerImpl.transform(ZetaSQLPlannerImpl.java:168)
>   at 
> org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.parseQuery(ZetaSQLQueryPlanner.java:99)
>   at 
> org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.parseQuery(ZetaSQLQueryPlanner.java:87)
>   at 
> org.apache.beam.sdk.extensions.sql.zetasql.ZetaSQLQueryPlanner.convertToBeamRel(ZetaSQLQueryPlanner.java:66)
>   at 
> org.apache.beam.sdk.extensions.sql.impl.BeamSqlEnv.parseQuery(BeamSqlEnv.java:104)
>   at 
>   ... 39 more
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 7
>   at 
> org.apache.beam.repackaged.sql.com.google.common.collect.RegularImmutableList.get(RegularImmutableList.java:58)
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.rel.rules.AggregateProjectMergeRule.apply(AggregateProjectMergeRule.java:96)
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.rel.rules.AggregateProjectMergeRule.onMatch(AggregateProjectMergeRule.java:73)
>   at 
> org.apache.beam.repackaged.sql.org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch(VolcanoRuleCall.java:205)
>   ... 48 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects

2019-10-19 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang resolved BEAM-8433.

Fix Version/s: Not applicable
   Resolution: Fixed

> DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
> 
>
> Key: BEAM-8433
> URL: https://issues.apache.org/jira/browse/BEAM-8433
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: Not applicable
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8433) DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects

2019-10-19 Thread Rui Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated BEAM-8433:
---
Status: Open  (was: Triage Needed)

> DataCatalogBigQueryIT runs for both Calcite and ZetaSQL dialects
> 
>
> Key: BEAM-8433
> URL: https://issues.apache.org/jira/browse/BEAM-8433
> Project: Beam
>  Issue Type: Improvement
>  Components: dsl-sql
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7190) Enable file system based token authentication for Samza portable runner

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7190?focusedWorklogId=330969=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330969
 ]

ASF GitHub Bot logged work on BEAM-7190:


Author: ASF GitHub Bot
Created on: 19/Oct/19 18:08
Start Date: 19/Oct/19 18:08
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on pull request #8597: [BEAM-7190] 
Enable file based token auth for samza portable runner
URL: https://github.com/apache/beam/pull/8597
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330969)
Time Spent: 1h 40m  (was: 1.5h)

> Enable file system based token authentication for Samza portable runner
> ---
>
> Key: BEAM-7190
> URL: https://issues.apache.org/jira/browse/BEAM-7190
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> For Samza and potentially other portable runners, there is a need to secure 
> the communication between sdk worker and runner. Currently the SSL/TLS in 
> portability is half done.
> However, after investigation we found that it's sufficient to just 1) use 
> loopback address 2) enforce authentication and that way the communication is 
> both authenticated and secured.
> This ticket intends to track the implementation of the solution above. More 
> details can be found in the following PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7190) Enable file system based token authentication for Samza portable runner

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7190?focusedWorklogId=330968=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330968
 ]

ASF GitHub Bot logged work on BEAM-7190:


Author: ASF GitHub Bot
Created on: 19/Oct/19 18:08
Start Date: 19/Oct/19 18:08
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #8597: [BEAM-7190] Enable 
file based token auth for samza portable runner
URL: https://github.com/apache/beam/pull/8597#issuecomment-544182218
 
 
   This pull request has been closed due to lack of activity. If you think that 
is incorrect, or the pull request requires review, you can revive the PR at any 
time.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330968)
Time Spent: 1.5h  (was: 1h 20m)

> Enable file system based token authentication for Samza portable runner
> ---
>
> Key: BEAM-7190
> URL: https://issues.apache.org/jira/browse/BEAM-7190
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Hai Lu
>Assignee: Hai Lu
>Priority: Major
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> For Samza and potentially other portable runners, there is a need to secure 
> the communication between sdk worker and runner. Currently the SSL/TLS in 
> portability is half done.
> However, after investigation we found that it's sufficient to just 1) use 
> loopback address 2) enforce authentication and that way the communication is 
> both authenticated and secured.
> This ticket intends to track the implementation of the solution above. More 
> details can be found in the following PR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-6732) Add "withResults()" for JdbcIO.write()

2019-10-19 Thread Ankit Bharti (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-6732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955162#comment-16955162
 ] 

Ankit Bharti commented on BEAM-6732:


Hey there,

I am trying to use above implementation using unbounded PCollection (Pubsub 
datasource) that writes to DB1 and DB2.

DB2 is having a Wait.On DB1(withResults) , but unfortunately DB2 is not getting 
updated. 

With bounded PCollection it works fine. Any leads ?

Thanks in advance.

 

> Add "withResults()" for JdbcIO.write()
> --
>
> Key: BEAM-6732
> URL: https://issues.apache.org/jira/browse/BEAM-6732
> Project: Beam
>  Issue Type: Improvement
>  Components: io-java-jdbc
>Reporter: Alexey Romanenko
>Assignee: Alexey Romanenko
>Priority: Minor
> Fix For: 2.13.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Sometimes, it's needed to have a collection of write results after using 
> {{JdbcIO.write()}}.
> For this purpose, we need to add new transform, e.g. {{.withResults()}}, 
> which will return results   of write.
> More details are on this mailing list discussion:  
> https://lists.apache.org/thread.html/ad088cfe519b4ff8df218cd3f996ac41ee256d568c7278c38576cdd4@%3Cdev.beam.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-7847) Generate Python SDK docs using Python 3

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-7847?focusedWorklogId=330932=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330932
 ]

ASF GitHub Bot logged work on BEAM-7847:


Author: ASF GitHub Bot
Created on: 19/Oct/19 13:06
Start Date: 19/Oct/19 13:06
Worklog Time Spent: 10m 
  Work Description: lazylynx commented on issue #9804: [WIP][BEAM-7847] 
generate SDK docs with Python3
URL: https://github.com/apache/beam/pull/9804#issuecomment-544141785
 
 
   Run Python PreCommit
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330932)
Time Spent: 20m  (was: 10m)

> Generate Python SDK docs using Python 3 
> 
>
> Key: BEAM-7847
> URL: https://issues.apache.org/jira/browse/BEAM-7847
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-core
>Reporter: Valentyn Tymofieiev
>Assignee: yoshiki obata
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Currently scripts/generate_pydoc.sh script fails on Python 3 with 
> "RuntimeError: empty_like method already has a docstring" errors:
> {noformat}
> pip install -e .[gcp,test]
> pip install Sphinx==1.6.5
> pip install sphinx_rtd_theme==0.2.4
> ./scripts/generate_pydoc.sh
> /home/valentyn/projects/beam/beam/beam/sdks/python/target/docs/source/apache_beam.testing.benchmarks.nexmark.queries.query0.rst:4:
>  WARNING: autodoc: failed to import module 
> 'apache_beam.testing.benchmarks.nexmark.queries.query0'; the following 
> exception was raised:
> Traceback (most recent call last):
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/sphinx/ext/autodoc.py",
>  line 658, in import_object
> __import__(self.modname)
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/__init__.py", 
> line 98, in 
> from apache_beam import io
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/io/__init__.py",
>  line 22, in 
> from apache_beam.io.avroio import *
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/io/avroio.py",
>  line 61, in 
> from apache_beam.io import filebasedsink
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/io/filebasedsink.py",
>  line 34, in 
> from apache_beam.io import iobase
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/io/iobase.py",
>  line 50, in 
> from apache_beam.transforms import core
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/transforms/__init__.py",
>  line 29, in 
> from apache_beam.transforms.util import *
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/transforms/util.py",
>  line 228, in 
> class _BatchSizeEstimator(object):
>   File 
> "/home/valentyn/projects/beam/beam/beam/sdks/python/apache_beam/transforms/util.py",
>  line 359, in _BatchSizeEstimator
> import numpy as np
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/numpy/__init__.py", 
> line 142, in 
> from . import core
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/numpy/core/__init__.py",
>  line 17, in 
> from . import multiarray
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/numpy/core/multiarray.py",
>  line 78, in 
> def empty_like(prototype, dtype=None, order=None, subok=None, shape=None):
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/numpy/core/overrides.py",
>  line 203, in decorator
> docs_from_dispatcher=docs_from_dispatcher)(implementation)
>   File 
> "/home/valentyn/tmp/venv/py3/lib/python3.6/site-packages/numpy/core/overrides.py",
>  line 159, in decorator
> add_docstring(implementation, dispatcher.__doc__)
> RuntimeError: empty_like method already has a docstring
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work logged] (BEAM-5775) Make the spark runner not serialize data unless spark is spilling to disk

2019-10-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-5775?focusedWorklogId=330906=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-330906
 ]

ASF GitHub Bot logged work on BEAM-5775:


Author: ASF GitHub Bot
Created on: 19/Oct/19 09:08
Start Date: 19/Oct/19 09:08
Worklog Time Spent: 10m 
  Work Description: stale[bot] commented on issue #8371: [BEAM-5775] Move 
(most) of the batch spark pipelines' transformations to using lazy 
serialization.
URL: https://github.com/apache/beam/pull/8371#issuecomment-544118190
 
 
   This pull request has been marked as stale due to 60 days of inactivity. It 
will be closed in 1 week if no further activity occurs. If you think that’s 
incorrect or this pull request requires a review, please simply write any 
comment. If closed, you can revive the PR at any time and @mention a reviewer 
or discuss it on the d...@beam.apache.org list. Thank you for your 
contributions.
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
---

Worklog Id: (was: 330906)
Time Spent: 12h  (was: 11h 50m)

> Make the spark runner not serialize data unless spark is spilling to disk
> -
>
> Key: BEAM-5775
> URL: https://issues.apache.org/jira/browse/BEAM-5775
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-spark
>Reporter: Mike Kaplinskiy
>Assignee: Mike Kaplinskiy
>Priority: Minor
> Fix For: 2.17.0
>
>  Time Spent: 12h
>  Remaining Estimate: 0h
>
> Currently for storage level MEMORY_ONLY, Beam does not coder-ify the data. 
> This lets Spark keep the data in memory avoiding the serialization round 
> trip. Unfortunately the logic is fairly coarse - as soon as you switch to 
> MEMORY_AND_DISK, Beam coder-ifys the data even though Spark might have chosen 
> to keep the data in memory, incurring the serialization overhead.
>  
> Ideally Beam would serialize the data lazily - as Spark chooses to spill to 
> disk. This would be a change in behavior when using beam, but luckily Spark 
> has a solution for folks that want data serialized in memory - 
> MEMORY_AND_DISK_SER will keep the data serialized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)