Fwd: Hadoop Weekly #191

2016-10-23 Thread Josh Elser
Congrats, the 1.0.0-incubating release was picked up by Hadoop Weekly :)
-- Forwarded message --
From: "Hadoop Weekly" 
Date: Oct 23, 2016 19:21
Subject: Hadoop Weekly #191
To: 
Cc:

Hadoop Weekly
> Issue #191
> 23 October 2016
>
> This week's issue is short and sweet with a few technical posts, two
> interesting news articles, and several exciting releases (including Apache
> Kafka 0.10.1.0). With Spark Summit Europe this week, expect lots of great
> content in the next issue. And if you're attending, please send interesting
> slides/talks my way!
>
> Technical
> ===
>
> Cloudera's CDH supports intra-node disk balancing since version 5.8.2
> (it's also part of the 3.0.0 alpha Apache release). Using this feature, a
> data node can rebalance data blocks across disks using the `hdfs
> diskbalancer` command. This post describes how the tool works and shows how
> to run it.
>
> http://blog.cloudera.com/blog/2016/10/how-to-use-the-new-
> hdfs-intra-datanode-disk-balancer-in-apache-hadoop/
>
>
> This post demonstrates the capabilities of the spark.ml library by
> building a logistic regression model to predict malignancy of cases from
> the Wisconsin Diagnostic Breast Cancer data set. The example code covers
> parsing, exploring a dataset with built-in statistics, extracting features
> from the input dataset, training the model, and evaluating the model.
>
> https://www.mapr.com/blog/predicting-breast-cancer-
> using-apache-spark-machine-learning-logistic-regression
>
>
> The Amazon Big Data blog has a tutorial for running RStudio with sparklyr
> on EMR. Thanks to a bootstrap action, a cluster complete with RStudio
> running on the master, can be launched with a single command.
>
> https://aws.amazon.com/blogs/big-data/running-sparklyr-
> rstudios-r-interface-to-spark-on-amazon-emr/
>
>
> The Databricks blog features a list of seven tips for debugging Apache
> Spark code on Databricks. Most of the suggestions, like "Scale up Spark
> jobs slowly for really large datasets" and "Examine the partitioning for
> your dataset," are generally applicable to all Spark users.
>
> https://databricks.com/blog/2016/10/18/7-tips-to-debug-
> apache-spark-code-faster-with-databricks.html
>
>
> News
> 
>
> InfoQ has an interview with Yahoo VP of Engineering, Peter Cnudde. Topics
> covered include Hadoop, Spark adoption at Yahoo (mostly for in-memory
> computing, not for ETL), and Caffe-on-Spark for deep learning.
>
> https://www.infoq.com/articles/peter-cnudde-yahoo-big-data
>
>
> ZDNet contributor Tony Baer has read between the lines when it comes to
> recent benchmarks by Cloudera and Hortonworks. The takeaways are as
> follows: 1) "SQL's the gateway drug to Hadoop." 2) Cloudera is trying to
> challenge Amazon (in this case Redshift), and 3) Hortonworks (via Hive's
> Live Long and Prosper) has caught up on the investment Cloudera made in
> Impala.
>
> http://www.zdnet.com/article/sql-on-hadoop-benchmarks-get-serious/
>
>
> Releases
> ===
>
> Apache Kafka 0.10.1.0 was released this week. It contains improvements
> from over 500 pull requests and the implementation of 15 Kafka Improvement
> Proposals. The Confluent blog has the highlights of additions/improvements
> to Kafka Server (time-based indexes, replication quotas, and improved log
> compaction), improvements to Kafka client APIs (interactive queries for
> Kafak Streams, improved memory management, secure quotas, and more), and
> bug fixes.
>
> http://mail-archives.apache.org/mod_mbox/kafka-users/
> 201610.mbox/%3CCAJL4t_oz9q4T9vn6Z-EBoazWJFyqHw4Y0L-
> PTowD%2BpFhcPv0VQ%40mail.gmail.com%3E
> http://www.confluent.io/blog/announcing-apache-kafka-0-10-1-0/
>
> Apache Fluo (incubating), recently had its first release since entering
> the incubator. Fluo is a tool for making "incremental updates to large data
> sets stored in Apache Accumulo" a la Google's Perculator.
>
> https://fluo.apache.org/release/fluo-1.0.0-incubating/
>
>
> Apache Flume 1.7.0 was released. It adds support for a `taildir` source
> and includes a number of improvements and bug fixes. Many of these are
> around Flume's integration with Apache Kafka.
>
> http://flume.apache.org/releases/1.7.0.html
>
>
> Apache NiFi 0.7.1 was released as a follow-up to July's 0.7.0 release
> (version 1.0.0 was also recently released—in August). This release adds a
> number of improvements and bug fixes.
>
> https://cwiki.apache.org/confluence/display/NIFI/
> Release+Notes#ReleaseNotes-Version0.7.1
>
>
> Apache Giraph 1.2.0 was released. Highlight's of the release include a new
> blocks API, support for graphs that don't fit in memory, and the addition
> of a new set of default configuration options based on Facebook's
> experience with Giraph.
>
> https://blogs.apache.org/giraph/entry/giraph_1_2_0_release
>
>
> `deeplearning4j` is a deep learning implementation that integrates with
> Hadoop and Spark and supports GPUs. Version 0.6.0 was recently 

Re: [VOTE] Apache Fluo Recipes 1.0.0-incubating-rc1

2016-10-23 Thread Christopher
+1

Verified sigs and hashes
mvn verify passes
Source tarball matches git commit (except expected DEPENDENCIES file added
by maven-remote-resources-plugin)
Jar manifests contain specified git commit
Jar sources and javadocs exist
Confirmed LICENSE, NOTICE, DISCLAIMER in source tarball
Manually inspected all jar MANIFEST.MF, LICENSE, and NOTICE files and all
look good to me


On Sat, Oct 22, 2016 at 12:39 AM Christopher  wrote:

> What makes you think that jsr305 is not compatibly licensed? I spent some
> time investigating this and the following is what I found. Unless I've
> missed something, it looks like there's no issue with jsr305 as a
> dependency.
>
> * It looks to me like it's licensed under BSD. This is according to the
> findbugs project[1], which has been redistributing the artifact after it
> effectively went dormant[2]. The Google Groups set up for developing jsr305
> seems to confirm the developers had agreed to distribute it under this[3].
> * It looks like jsr305 is often incorrectly uploaded to Maven Central (by
> findbugs?) under AL2, which is the license in the POM for our dependency
> (version 3.0.0) [4]. It was once uploaded (again, seemingly incorrectly) as
> LGPL, but we're not using that version [5].
> * There is an outstanding GitHub issue for findbugs to clarify the
> license[6], because it looks like they've been mislabeling it when they
> redistribute. But, it's also possible that they've been able to relicense
> under AL2, and forgot to update their docs which still say it's BSD.
> * jsr305 is used by us during the build, as a test dependency. it looks
> like that's okay, since we're not bundling it[7].
> * It is also used as a compile and/or runtime transitive dependency via
> Apache Spark. Even if we did depend on it directly, it seems like it should
> be fine because it's an optional part of the project[8], as long as we're
> not bundling it, and we're not.
> * Is it a problem for Apache Spark to depend on this directly? If it's
> not, I can't imagine it would be for us to depend on it transitively,
> through them.
>
> [1]:
> https://github.com/findbugsproject/findbugs/blob/3.0.1/findbugs/licenses/LICENSE-jsr305.txt
> [2]: https://jcp.org/en/jsr/detail?id=305
> [3]: https://groups.google.com/forum/#!topic/jsr-305/gQWGmiWMjE8
> [4]:
> https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
> [5]:
> https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.8/jsr305-1.3.8.pom
> [6]: https://github.com/findbugsproject/findbugs/issues/128
> [7]: http://www.apache.org/legal/resolved.html#prohibited
> [8]: http://www.apache.org/legal/resolved.html#optional
>
>
> On Fri, Oct 21, 2016 at 6:37 PM Josh Elser  wrote:
>
> +1
>
> * Sigs/xsums OK
> * No binaries in release
> * KEYS is accurate
> * Can build from source
> * Direct dependencies OK (beware that you are transitively bringing in
> com.google.code.findbugs:jsr305:jar:3.0.0 which is not compatibly
> licensed -- this should be fixed in the future)
> * No Copyright notices
> * apache-rat:check passes
> * Can run all tests
> * Artifacts built from release appear to be appropriately licensed.
> * Commit is contained in repository
> * Would prefer to see apache-fluo-recipes as the name instead.
>
> - Josh
>
> Keith Turner wrote:
> > Fluo Developers,
> >
> > Please consider the following candidate for Fluo Recipes
> 1.0.0-incubating.
> >
> > Git Commit:
> >  682eff983f1fe6e60b75c36d3b2f782c6a93b155
> > Branch:
> >  1.0.0-incubating-rc1
> >
> > If this vote passes, a gpg-signed tag will be created using:
> >  git tag -f -m 'Apache Fluo Recipes 1.0.0-incubating' -s
> > rel/fluo-recipes-1.0.0-incubating \
> >  682eff983f1fe6e60b75c36d3b2f782c6a93b155
> > Staging repo:
> > https://repository.apache.org/content/repositories/orgapachefluo-1016
> > Source (official release artifact):
> >
> https://repository.apache.org/content/repositories/orgapachefluo-1016/org/apache/fluo/fluo-recipes/1.0.0-incubating/fluo-recipes-1.0.0-incubating-source-release.tar.gz
> > (Append ".sha1", ".md5", or ".asc" to download the signature/hash for a
> > given artifact.)
> >
> > All artifacts were built and staged with:
> >  mvn release:prepare&&  mvn release:perform
> >
> > Signing keys are available at
> > https://www.apache.org/dist/incubator/fluo/KEYS
> > (Expected fingerprint: CF72CA07C8BC86A1C862765F9AACFB56352ACF76)
> >
> > Release notes (in progress) can be found at:
> > https://fluo.apache.org/.../1.0.0-incubating
> >
> > Please vote one of:
> > [ ] +1 - I have verified and accept...
> > [ ] +0 - I have reservations, but not strong enough to vote against...
> > [ ] -1 - Because..., I do not accept...
> > ... these artifacts as the 1.0.0-incubating release of Apache Fluo
> Recipes.
> >
> > This vote will end on Sun Oct 23 22:30:00 UTC 2016
> > (Sun Oct 23 18:30:00 EDT 2016 / Sun Oct 23 15:30:00 PDT 2016)
> >
> > Thanks!
> >
> > P.S. Hint: download the whole 

Re: On Findbugs jsr305 (was Re: [VOTE] Apache Fluo Recipes 1.0.0-incubating-rc1)

2016-10-23 Thread Christopher
Aside from the reporter of that issue's understandable confusion about the
difference between FindBugs license itself and of something bundled with
FindBugs, the evidence seems overwhelming that the artifact is BSD:

- Original developers agreed to BSD on Google Groups forum
- Original repo (as archived on code.google.com) contains BSD LICENSE
- FindBugs repo where it's bundled declares it BSD
- RHEL/CentOS and Fedora RPMs all delcare it BSD
- LICENSE in the current maintainer's repo is BSD (relocated from
code.google.com)
- The current maintainer has acknowledged the incorrect AL2 license in the
Maven Central artifact and is willing to accept a PR to fix[1].

Even if it were LGPL, I still don't think there'd be an issue, because of
how we're using it as a build-dep and transitive-dep for an optional
feature.
If you still think there's an issue, let's please escalate to LEGAL for
resolution.

[1]: https://github.com/amaembo/jsr-305/issues/27


On Sun, Oct 23, 2016 at 2:47 PM Josh Elser  wrote:

> The ambiguity of the conversation you provided in [6] is exactly why I
> have this opinion. Unless one of the devs can definitively say "it is
> BSD", there's way too much mis-information for me to feel comfortable
> with it.
>
> Given the availability of
> https://stephenc.github.com/findbugs-annotations, it's a no-brainer to
> use that instead, IMO.
>
> Specifically to Fluo, I did not inspect its usage that closely. If it's
> only used at build time, then, as you point out, it's a non-issue.
>
> Christopher wrote:
> > What makes you think that jsr305 is not compatibly licensed? I spent some
> > time investigating this and the following is what I found. Unless I've
> > missed something, it looks like there's no issue with jsr305 as a
> > dependency.
> >
> > * It looks to me like it's licensed under BSD. This is according to the
> > findbugs project[1], which has been redistributing the artifact after it
> > effectively went dormant[2]. The Google Groups set up for developing
> jsr305
> > seems to confirm the developers had agreed to distribute it under
> this[3].
> > * It looks like jsr305 is often incorrectly uploaded to Maven Central (by
> > findbugs?) under AL2, which is the license in the POM for our dependency
> > (version 3.0.0) [4]. It was once uploaded (again, seemingly incorrectly)
> as
> > LGPL, but we're not using that version [5].
> > * There is an outstanding GitHub issue for findbugs to clarify the
> > license[6], because it looks like they've been mislabeling it when they
> > redistribute. But, it's also possible that they've been able to relicense
> > under AL2, and forgot to update their docs which still say it's BSD.
> > * jsr305 is used by us during the build, as a test dependency. it looks
> > like that's okay, since we're not bundling it[7].
> > * It is also used as a compile and/or runtime transitive dependency via
> > Apache Spark. Even if we did depend on it directly, it seems like it
> should
> > be fine because it's an optional part of the project[8], as long as we're
> > not bundling it, and we're not.
> > * Is it a problem for Apache Spark to depend on this directly? If it's
> not,
> > I can't imagine it would be for us to depend on it transitively, through
> > them.
> >
> > [1]:
> >
> https://github.com/findbugsproject/findbugs/blob/3.0.1/findbugs/licenses/LICENSE-jsr305.txt
> > [2]: https://jcp.org/en/jsr/detail?id=305
> > [3]: https://groups.google.com/forum/#!topic/jsr-305/gQWGmiWMjE8
> > [4]:
> >
> https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.0/jsr305-3.0.0.pom
> > [5]:
> >
> https://repo1.maven.org/maven2/com/google/code/findbugs/jsr305/1.3.8/jsr305-1.3.8.pom
> > [6]: https://github.com/findbugsproject/findbugs/issues/128
> > [7]: http://www.apache.org/legal/resolved.html#prohibited
> > [8]: http://www.apache.org/legal/resolved.html#optional
> >
> > On Fri, Oct 21, 2016 at 6:37 PM Josh Elser  wrote:
> >
> >> +1
> >>
> >> * Sigs/xsums OK
> >> * No binaries in release
> >> * KEYS is accurate
> >> * Can build from source
> >> * Direct dependencies OK (beware that you are transitively bringing in
> >> com.google.code.findbugs:jsr305:jar:3.0.0 which is not compatibly
> >> licensed -- this should be fixed in the future)
> >> * No Copyright notices
> >> * apache-rat:check passes
> >> * Can run all tests
> >> * Artifacts built from release appear to be appropriately licensed.
> >> * Commit is contained in repository
> >> * Would prefer to see apache-fluo-recipes as the name instead.
> >>
> >> - Josh
> >>
> >> Keith Turner wrote:
> >>> Fluo Developers,
> >>>
> >>> Please consider the following candidate for Fluo Recipes
> >> 1.0.0-incubating.
> >>> Git Commit:
> >>>   682eff983f1fe6e60b75c36d3b2f782c6a93b155
> >>> Branch:
> >>>   1.0.0-incubating-rc1
> >>>
> >>> If this vote passes, a gpg-signed tag will be created using:
> >>>   git tag -f -m 'Apache Fluo Recipes 1.0.0-incubating' -s
> >>>