[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)

2016-10-17 Thread Marius Van Niekerk (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583801#comment-15583801
 ] 

Marius Van Niekerk commented on TOREE-349:
--

So do you need to do the same thing for spark-shell?

> ClassCastException when reading Avro from another thread (Toree master / 
> Spark 2.0.0)
> -
>
> Key: TOREE-349
> URL: https://issues.apache.org/jira/browse/TOREE-349
> Project: TOREE
>  Issue Type: Bug
>Reporter: Andrew Kerr
> Attachments: avro-csv-threading.scala.ipynb, run.sh
>
>
> When using Toree (master branch commit 
> e8ecd0623c65ad104045b1797fb27f69b8dfc23f)
> with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS`
> and attempting to load an avro file into a dataframe *in a separate thread*
> then an exception is thrown
> `java.lang.ClassCastException: 
> com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be 
> cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration`
> here
> https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156
> Will attach a Jupyter notebook that illustrates the problem and includes full
> stack trace, with a script showing environment.
> The class that throws the exception `DefaultSource` broadcasts Hadoop config
> and returns an anonymous function that accesses that config. The exception
> occurs when that function is executed and it attempts to access the config.
> This looks like a class loader mismatch problem to me ("Class Identity 
> Crisis").
> With a bit of hacking of `spark-avro` I've seen the class loader for 
> `DefaultSource` when the config is broadcast to be 
> `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411`
> and when the config is read to be
> `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0`
> If a fat jar including `spark-avro` is built and included with `--jars=...`
> then the same problem occurs.
> Interestingly the Spark's included support for CSV uses the same pattern as
> Avro, broadcasting a config, but works as expected as shown in the notebook.
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108
> Avro also works as expected when an application fat jar is built and passed 
> to 
> `spark-submit` without involving Toree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TOREE-350) Address comment on 0.1.0 rc2

2016-10-17 Thread Gino Bustelo (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582357#comment-15582357
 ] 

Gino Bustelo commented on TOREE-350:


Comments from Hitesh
   - no KEYS file so no way to verify the gpg signature ( unless one downloads 
an unverified key from a public server )
   - bin tarball should untar into an apache-toree-incubating or apache-toree 
directory, source tarball untars into ./ - should be fixed to use a top-level 
dir IMO.
   - Something to check on - are the copyrights needed in the NOTICE file or 
the LICENSE file? I am not too sure if there are needed in the NOTICE file.
   - licenses/LICENSE-jline.txt seems to have some html but not the actual 
license content. Did not look at all the files so folks should re-check those.
   - Most projects tend to have one license file per license type and not a 
license file per dependency - with the copyrights called out in the main 
LICENSE file I believe.
   - source tarball seems to have too many licenses. Unless 
jline/scala,asm.ammonite, etc  are bundled into the source tarball, they do not 
need to be called out in the LICENSE and/or NOTICE file.
   - bunch of markdown files without a license header
   - there are a bunch of test jars checked into the source. Is there ALv2 
provenance for all of them (including the sparkr tarball )?

> Address comment on 0.1.0 rc2
> 
>
> Key: TOREE-350
> URL: https://issues.apache.org/jira/browse/TOREE-350
> Project: TOREE
>  Issue Type: Bug
>Reporter: Gino Bustelo
> Fix For: 0.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Toree 0.1.0 RC2

2016-10-17 Thread Gino Bustelo
Actually... on the license thing, we also followed Spark's lead
https://github.com/apache/spark/tree/cff560755244dd4ccb998e0c56e81d2620cd4cff/licenses

On Mon, Oct 17, 2016 at 10:41 AM Gino Bustelo  wrote:

> I'm working on changes in
> https://github.com/apache/incubator-toree/pull/75. I've address most of
> the issues above.
>
> > - no KEYS file so no way to verify the gpg signature ( unless one
> downloads an unverified key from a public server )
> Where do I put this KEYS file?
>
> > - Something to check on - are the copyrights needed in the NOTICE file
> or the LICENSE file? I am not too sure if there are needed in the NOTICE
> file.
> Who can answer this? Is this a stop ship problem?
>
> > - Most projects tend to have one license file per license type and not
> a license file per dependency - with the copyrights called out in the main
> LICENSE file I believe.
> We did the initial work to prep for release a while back, so I can't
> remember what project we modeled against. Is this a stop ship problem?
>
> >  - bunch of markdown files without a license header
> Do we have to add headers to these? Look at markdown files in Spark (
> https://raw.githubusercontent.com/apache/spark/cff560755244dd4ccb998e0c56e81d2620cd4cff/docs/quick-start.md),
> they have no headers.
>
> >  - there are a bunch of test jars checked into the source. Is there ALv2
> provenance for all of them (including the sparkr tarball )?
> Test jars are part of testing Magic downloading. They are ours.
>
>
> On Thu, Oct 13, 2016 at 6:00 PM Hitesh Shah  wrote:
>
> md5/sha sigs look fine. pgp sig also looked good though see below.
>
> Comments:
>- no KEYS file so no way to verify the gpg signature ( unless one
> downloads an unverified key from a public server )
>- bin tarball should untar into an apache-toree-incubating or
> apache-toree directory, source tarball untars into ./ - should be fixed to
> use a top-level dir IMO.
>- Something to check on - are the copyrights needed in the NOTICE file
> or the LICENSE file? I am not too sure if there are needed in the NOTICE
> file.
>- licenses/LICENSE-jline.txt seems to have some html but not the actual
> license content. Did not look at all the files so folks should re-check
> those.
>- Most projects tend to have one license file per license type and not
> a license file per dependency - with the copyrights called out in the main
> LICENSE file I believe.
>- source tarball seems to have too many licenses. Unless
> jline/scala,asm.ammonite, etc  are bundled into the source tarball, they do
> not need to be called out in the LICENSE and/or NOTICE file.
>- bunch of markdown files without a license header
>- there are a bunch of test jars checked into the source. Is there ALv2
> provenance for all of them (including the sparkr tarball )?
>
> Vote thread has a bahir related typo.
>
> thanks
> — Hitesh
>
>
> > On Oct 11, 2016, at 12:16 PM, Gino Bustelo  wrote:
> >
> > Please vote to approve the release of the following candidate as
> > Apache Toree version 0.1.0
> >
> >
> > The commit to be voted on is 119bf3e2d1d16986f55802cf2323e8629ea25ef8
> > <
> https://github.com/apache/incubator-toree/tree/119bf3e2d1d16986f55802cf2323e8629ea25ef8
> >
> >
> >
> https://github.com/apache/incubator-toree/tree/119bf3e2d1d16986f55802cf2323e8629ea25ef8
> > <
> https://github.com/apache/bahir/tree/368c436ae2ad34b3ca64d11801aee69e478555f7
> >
> >
> > All distribution packages, including signatures, digests, etc. can be
> found at:
> >
> > *https://dist.apache.org/repos/dist/dev/incubator/toree/0.1.0/rc2/
> > *
> >
> > The vote is open for at least 72 hours and passes if a majority of at
> least
> > 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Toree 0.1.0
> > [ ] -1 Do not release this package because ...
>
>


Re: [VOTE] Apache Toree 0.1.0 RC2

2016-10-17 Thread Gino Bustelo
I'm working on changes in https://github.com/apache/incubator-toree/pull/75.
I've address most of the issues above.

> - no KEYS file so no way to verify the gpg signature ( unless one
downloads an unverified key from a public server )
Where do I put this KEYS file?

> - Something to check on - are the copyrights needed in the NOTICE file or
the LICENSE file? I am not too sure if there are needed in the NOTICE file.
Who can answer this? Is this a stop ship problem?

> - Most projects tend to have one license file per license type and not a
license file per dependency - with the copyrights called out in the main
LICENSE file I believe.
We did the initial work to prep for release a while back, so I can't
remember what project we modeled against. Is this a stop ship problem?

>  - bunch of markdown files without a license header
Do we have to add headers to these? Look at markdown files in Spark (
https://raw.githubusercontent.com/apache/spark/cff560755244dd4ccb998e0c56e81d2620cd4cff/docs/quick-start.md),
they have no headers.

>  - there are a bunch of test jars checked into the source. Is there ALv2
provenance for all of them (including the sparkr tarball )?
Test jars are part of testing Magic downloading. They are ours.


On Thu, Oct 13, 2016 at 6:00 PM Hitesh Shah  wrote:

> md5/sha sigs look fine. pgp sig also looked good though see below.
>
> Comments:
>- no KEYS file so no way to verify the gpg signature ( unless one
> downloads an unverified key from a public server )
>- bin tarball should untar into an apache-toree-incubating or
> apache-toree directory, source tarball untars into ./ - should be fixed to
> use a top-level dir IMO.
>- Something to check on - are the copyrights needed in the NOTICE file
> or the LICENSE file? I am not too sure if there are needed in the NOTICE
> file.
>- licenses/LICENSE-jline.txt seems to have some html but not the actual
> license content. Did not look at all the files so folks should re-check
> those.
>- Most projects tend to have one license file per license type and not
> a license file per dependency - with the copyrights called out in the main
> LICENSE file I believe.
>- source tarball seems to have too many licenses. Unless
> jline/scala,asm.ammonite, etc  are bundled into the source tarball, they do
> not need to be called out in the LICENSE and/or NOTICE file.
>- bunch of markdown files without a license header
>- there are a bunch of test jars checked into the source. Is there ALv2
> provenance for all of them (including the sparkr tarball )?
>
> Vote thread has a bahir related typo.
>
> thanks
> — Hitesh
>
>
> > On Oct 11, 2016, at 12:16 PM, Gino Bustelo  wrote:
> >
> > Please vote to approve the release of the following candidate as
> > Apache Toree version 0.1.0
> >
> >
> > The commit to be voted on is 119bf3e2d1d16986f55802cf2323e8629ea25ef8
> > <
> https://github.com/apache/incubator-toree/tree/119bf3e2d1d16986f55802cf2323e8629ea25ef8
> >
> >
> >
> https://github.com/apache/incubator-toree/tree/119bf3e2d1d16986f55802cf2323e8629ea25ef8
> > <
> https://github.com/apache/bahir/tree/368c436ae2ad34b3ca64d11801aee69e478555f7
> >
> >
> > All distribution packages, including signatures, digests, etc. can be
> found at:
> >
> > *https://dist.apache.org/repos/dist/dev/incubator/toree/0.1.0/rc2/
> > *
> >
> > The vote is open for at least 72 hours and passes if a majority of at
> least
> > 3 +1 PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Toree 0.1.0
> > [ ] -1 Do not release this package because ...
>
>


[jira] [Commented] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)

2016-10-17 Thread Andrew Kerr (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582769#comment-15582769
 ] 

Andrew Kerr commented on TOREE-349:
---

This code works as expected:

```
val classLoader = Thread.currentThread().getContextClassLoader
println(classLoader)
val future = Future{
Thread.currentThread().setContextClassLoader(classLoader)
session.read.avro("foo")
}
val result = Await.result(future, Duration.Inf)
result.show()
```

The classloader is 
`scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30`

Obviously this isn't ideal. It also isn't necessary for loading CSV files, 
which are implemented in a similar way to the Avro loader (as in the Avro code 
looks copy-pasted from CSV).

> ClassCastException when reading Avro from another thread (Toree master / 
> Spark 2.0.0)
> -
>
> Key: TOREE-349
> URL: https://issues.apache.org/jira/browse/TOREE-349
> Project: TOREE
>  Issue Type: Bug
>Reporter: Andrew Kerr
> Attachments: avro-csv-threading.scala.ipynb, run.sh
>
>
> When using Toree (master branch commit 
> e8ecd0623c65ad104045b1797fb27f69b8dfc23f)
> with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS`
> and attempting to load an avro file into a dataframe *in a separate thread*
> then an exception is thrown
> `java.lang.ClassCastException: 
> com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be 
> cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration`
> here
> https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156
> Will attach a Jupyter notebook that illustrates the problem and includes full
> stack trace, with a script showing environment.
> The class that throws the exception `DefaultSource` broadcasts Hadoop config
> and returns an anonymous function that accesses that config. The exception
> occurs when that function is executed and it attempts to access the config.
> This looks like a class loader mismatch problem to me ("Class Identity 
> Crisis").
> With a bit of hacking of `spark-avro` I've seen the class loader for 
> `DefaultSource` when the config is broadcast to be 
> `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411`
> and when the config is read to be
> `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0`
> If a fat jar including `spark-avro` is built and included with `--jars=...`
> then the same problem occurs.
> Interestingly the Spark's included support for CSV uses the same pattern as
> Avro, broadcasting a config, but works as expected as shown in the notebook.
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108
> Avro also works as expected when an application fat jar is built and passed 
> to 
> `spark-submit` without involving Toree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)

2016-10-17 Thread Andrew Kerr (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582769#comment-15582769
 ] 

Andrew Kerr edited comment on TOREE-349 at 10/17/16 5:04 PM:
-

This code works as expected:

{code}
val classLoader = Thread.currentThread().getContextClassLoader
println(classLoader)
val future = Future{
Thread.currentThread().setContextClassLoader(classLoader)
session.read.avro("foo")
}
val result = Await.result(future, Duration.Inf)
result.show()
{code}

The classloader is 
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30

Obviously this isn't ideal. It also isn't necessary for loading CSV files, 
which are implemented in a similar way to the Avro loader (as in the Avro code 
looks copy-pasted from CSV).


was (Author: andrewkerr):
This code works as expected:

```
val classLoader = Thread.currentThread().getContextClassLoader
println(classLoader)
val future = Future{
Thread.currentThread().setContextClassLoader(classLoader)
session.read.avro("foo")
}
val result = Await.result(future, Duration.Inf)
result.show()
```

The classloader is 
`scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30`

Obviously this isn't ideal. It also isn't necessary for loading CSV files, 
which are implemented in a similar way to the Avro loader (as in the Avro code 
looks copy-pasted from CSV).

> ClassCastException when reading Avro from another thread (Toree master / 
> Spark 2.0.0)
> -
>
> Key: TOREE-349
> URL: https://issues.apache.org/jira/browse/TOREE-349
> Project: TOREE
>  Issue Type: Bug
>Reporter: Andrew Kerr
> Attachments: avro-csv-threading.scala.ipynb, run.sh
>
>
> When using Toree (master branch commit 
> e8ecd0623c65ad104045b1797fb27f69b8dfc23f)
> with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS`
> and attempting to load an avro file into a dataframe *in a separate thread*
> then an exception is thrown
> `java.lang.ClassCastException: 
> com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be 
> cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration`
> here
> https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156
> Will attach a Jupyter notebook that illustrates the problem and includes full
> stack trace, with a script showing environment.
> The class that throws the exception `DefaultSource` broadcasts Hadoop config
> and returns an anonymous function that accesses that config. The exception
> occurs when that function is executed and it attempts to access the config.
> This looks like a class loader mismatch problem to me ("Class Identity 
> Crisis").
> With a bit of hacking of `spark-avro` I've seen the class loader for 
> `DefaultSource` when the config is broadcast to be 
> `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411`
> and when the config is read to be
> `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0`
> If a fat jar including `spark-avro` is built and included with `--jars=...`
> then the same problem occurs.
> Interestingly the Spark's included support for CSV uses the same pattern as
> Avro, broadcasting a config, but works as expected as shown in the notebook.
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108
> Avro also works as expected when an application fat jar is built and passed 
> to 
> `spark-submit` without involving Toree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TOREE-349) ClassCastException when reading Avro from another thread (Toree master / Spark 2.0.0)

2016-10-17 Thread Andrew Kerr (JIRA)

[ 
https://issues.apache.org/jira/browse/TOREE-349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15582769#comment-15582769
 ] 

Andrew Kerr edited comment on TOREE-349 at 10/17/16 5:06 PM:
-

This code works as expected:

{code:language=scala}
val classLoader = Thread.currentThread().getContextClassLoader
println(classLoader)
val future = Future{
Thread.currentThread().setContextClassLoader(classLoader)
session.read.avro("foo")
}
val result = Await.result(future, Duration.Inf)
result.show()
{code}

The classloader is 
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30

Obviously this isn't ideal. It also isn't necessary for loading CSV files, 
which are implemented in a similar way to the Avro loader (as in the Avro code 
looks copy-pasted from CSV).


was (Author: andrewkerr):
This code works as expected:

{code}
val classLoader = Thread.currentThread().getContextClassLoader
println(classLoader)
val future = Future{
Thread.currentThread().setContextClassLoader(classLoader)
session.read.avro("foo")
}
val result = Await.result(future, Duration.Inf)
result.show()
{code}

The classloader is 
scala.tools.nsc.interpreter.IMain$TranslatingClassLoader@864ff30

Obviously this isn't ideal. It also isn't necessary for loading CSV files, 
which are implemented in a similar way to the Avro loader (as in the Avro code 
looks copy-pasted from CSV).

> ClassCastException when reading Avro from another thread (Toree master / 
> Spark 2.0.0)
> -
>
> Key: TOREE-349
> URL: https://issues.apache.org/jira/browse/TOREE-349
> Project: TOREE
>  Issue Type: Bug
>Reporter: Andrew Kerr
> Attachments: avro-csv-threading.scala.ipynb, run.sh
>
>
> When using Toree (master branch commit 
> e8ecd0623c65ad104045b1797fb27f69b8dfc23f)
> with `--packages=com.databricks:spark-avro_2.11:3.0.1` in `SPARK_OPTS`
> and attempting to load an avro file into a dataframe *in a separate thread*
> then an exception is thrown
> `java.lang.ClassCastException: 
> com.databricks.spark.avro.DefaultSource$SerializableConfiguration cannot be 
> cast to com.databricks.spark.avro.DefaultSource$SerializableConfiguration`
> here
> https://github.com/databricks/spark-avro/blob/v3.0.1/src/main/scala/com/databricks/spark/avro/DefaultSource.scala#L156
> Will attach a Jupyter notebook that illustrates the problem and includes full
> stack trace, with a script showing environment.
> The class that throws the exception `DefaultSource` broadcasts Hadoop config
> and returns an anonymous function that accesses that config. The exception
> occurs when that function is executed and it attempts to access the config.
> This looks like a class loader mismatch problem to me ("Class Identity 
> Crisis").
> With a bit of hacking of `spark-avro` I've seen the class loader for 
> `DefaultSource` when the config is broadcast to be 
> `scala.reflect.internal.util.ScalaClassLoader$URLClassLoader@31ac5411`
> and when the config is read to be
> `org.apache.spark.util.MutableURLClassLoader@3d3fcdb0`
> If a fat jar including `spark-avro` is built and included with `--jars=...`
> then the same problem occurs.
> Interestingly the Spark's included support for CSV uses the same pattern as
> Avro, broadcasting a config, but works as expected as shown in the notebook.
> https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L108
> Avro also works as expected when an application fat jar is built and passed 
> to 
> `spark-submit` without involving Toree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)