Re: Broadcast big dataset

2016-09-28 Thread WangJianfei
First thank you very much!
  My executor memeory is also 4G, but my spark version is 1.5. Does spark
version make a trouble?




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127p19143.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[VOTE] Release Apache Spark 2.0.1 (RC4)

2016-09-28 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version
2.0.1. The vote is open until Sat, Oct 1, 2016 at 20:00 PDT and passes if a
majority of at least 3+1 PMC votes are cast.

[ ] +1 Release this package as Apache Spark 2.0.1
[ ] -1 Do not release this package because ...


The tag to be voted on is v2.0.1-rc4
(933d2c1ea4e5f5c4ec8d375b5ccaa4577ba4be38)

This release candidate resolves 301 issues:
https://s.apache.org/spark-2.0.1-jira

The release files, including signatures, digests, etc. can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-bin/

Release artifacts are signed with the following key:
https://people.apache.org/keys/committer/pwendell.asc

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1203/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.1-rc4-docs/


Q: How can I help test this release?
A: If you are a Spark user, you can help us test this release by taking an
existing Spark workload and running on this release candidate, then
reporting any regressions from 2.0.0.

Q: What justifies a -1 vote for this release?
A: This is a maintenance release in the 2.0.x series.  Bugs already present
in 2.0.0, missing features, or bugs related to new features will not
necessarily block this release.

Q: What fix version should I use for patches merging into branch-2.0 from
now on?
A: Please mark the fix version as 2.0.2, rather than 2.0.1. If a new RC
(i.e. RC5) is cut, I will change the fix version of those patches to 2.0.1.


Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Michael Gummelt
+1

I know this is cancelled, but FYI, RC3 passes mesos/spark integration tests

On Wed, Sep 28, 2016 at 2:52 AM, Sean Owen  wrote:

> (Process-wise there's no problem with that. The vote is open for at
> least 3 days and ends when the RM says it ends. So it's valid anyway
> as the vote is still open.)
>
> On Tue, Sep 27, 2016 at 8:37 PM, Reynold Xin  wrote:
> > So technically the vote has passed, but IMHO it does not make sense to
> > release this and then immediately release 2.0.2. I will work on a new RC
> > once SPARK-17666 and SPARK-17673 are fixed.
> >
> > Please shout if you disagree.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


-- 
Michael Gummelt
Software Engineer
Mesosphere


Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Jakob Odersky
I agree with Sean's answer, you can check out the relevant serializer
here 
https://github.com/twitter/chill/blob/develop/chill-scala/src/main/scala/com/twitter/chill/Traversable.scala

On Wed, Sep 28, 2016 at 3:11 AM, Sean Owen  wrote:
> My guess is that Kryo specially handles Maps generically or relies on
> some mechanism that does, and it happens to iterate over all
> key/values as part of that and of course there aren't actually any
> key/values in the map. The Java serialization is a much more literal
> (expensive) field-by-field serialization which works here because
> there's no special treatment. I think you could register a custom
> serializer that handles this case. Or work around it in your client
> code. I know there have been other issues with Kryo and Map because,
> for example, sometimes a Map in an application is actually some
> non-serializable wrapper view.
>
> On Wed, Sep 28, 2016 at 3:18 AM, Maciej Szymkiewicz
>  wrote:
>> Hi everyone,
>>
>> I suspect there is no point in submitting a JIRA to fix this (not a Spark
>> issue?) but I would like to know if this problem is documented anywhere.
>> Somehow Kryo is loosing default value during serialization:
>>
>> scala> import org.apache.spark.{SparkContext, SparkConf}
>> import org.apache.spark.{SparkContext, SparkConf}
>>
>> scala> val aMap = Map[String, Long]().withDefaultValue(0L)
>> aMap: scala.collection.immutable.Map[String,Long] = Map()
>>
>> scala> aMap("a")
>> res6: Long = 0
>>
>> scala> val sc = new SparkContext(new
>> SparkConf().setAppName("bar").set("spark.serializer",
>> "org.apache.spark.serializer.KryoSerializer"))
>>
>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>> 16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
>> java.util.NoSuchElementException: key not found: a
>>
>> while Java serializer works just fine:
>>
>> scala> val sc = new SparkContext(new
>> SparkConf().setAppName("bar").set("spark.serializer",
>> "org.apache.spark.serializer.JavaSerializer"))
>>
>> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
>> res9: Long = 0
>>
>> --
>> Best regards,
>> Maciej
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Joseph Bradley
+1 for 4 months.  With QA taking about a month, that's very reasonable.

My main ask (especially for MLlib) is for contributors and committers to
take extra care not to delay on updating the Programming Guide for new
APIs.  Documentation debt often collects and has to be paid off during QA,
and a longer cycle will exacerbate this problem.

On Wed, Sep 28, 2016 at 7:30 AM, Tom Graves 
wrote:

> +1 to 4 months.
>
> Tom
>
>
> On Tuesday, September 27, 2016 2:07 PM, Reynold Xin 
> wrote:
>
>
> We are 2 months past releasing Spark 2.0.0, an important milestone for the
> project. Spark 2.0.0 deviated (took 6 month from the regular release
> cadence we had for the 1.x line, and we never explicitly discussed what the
> release cadence should look like for 2.x. Thus this email.
>
> During Spark 1.x, roughly every three months we make a new 1.x feature
> release (e.g. 1.5.0 comes out three months after 1.4.0). Development
> happened primarily in the first two months, and then a release branch was
> cut at the end of month 2, and the last month was reserved for QA and
> release preparation.
>
> During 2.0.0 development, I really enjoyed the longer release cycle
> because there was a lot of major changes happening and the longer time was
> critical for thinking through architectural changes as well as API design.
> While I don't expect the same degree of drastic changes in a 2.x feature
> release, I do think it'd make sense to increase the length of release cycle
> so we can make better designs.
>
> My strawman proposal is to maintain a regular release cadence, as we did
> in Spark 1.x, and increase the cycle from 3 months to 4 months. This
> effectively gives us ~50% more time to develop (in reality it'd be slightly
> less than 50% since longer dev time also means longer QA time). As for
> maintenance releases, I think those should still be cut on-demand, similar
> to Spark 1.x, but more aggressively.
>
> To put this into perspective, 4-month cycle means we will release Spark
> 2.1.0 at the end of Nov or early Dec (and branch cut / code freeze at the
> end of Oct).
>
> I am curious what others think.
>
>
>
>
>


Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Burak, you can configure what happens with corrupt records for the
datasource using the parse mode.  The parse will still fail, so we can't
get any data out of it, but we do leave the JSON in another column for you
to inspect.

In the case of this function, we'll just return null if its unparable.  You
could filter for rows where the function returns null and inspect the input
if you want to see whats going wrong.

When you talk about ‘user specified schema’ do you mean for the user to
> supply an additional schema, or that you’re using the schema that’s
> described by the JSON string?


I mean we don't do schema inference (which we might consider adding, but
that would be a much larger change than this PR).  You need to construct a
StructType that says what columns you want to extract from the JSON column
and pass that in.  I imagine in many cases the user will run schema
inference ahead of time and then encode the inferred schema into their
program.


On Wed, Sep 28, 2016 at 11:04 AM, Burak Yavuz  wrote:

> I would really love something like this! It would be great if it doesn't
> throw away corrupt_records like the Data Source.
>
> On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
> wrote:
>
>> We are currently pulling out the JSON columns, passing them through
>> read.json, and then joining them back onto the initial DF so something like
>> from_json would be a nice quality of life improvement for us.
>>
>> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust <
>> mich...@databricks.com> wrote:
>>
>>> Spark SQL has great support for reading text files that contain JSON
>>> data. However, in many cases the JSON data is just one column amongst
>>> others. This is particularly true when reading from sources such as Kafka. 
>>> This
>>> PR  adds a new functions
>>> from_json that converts a string column into a nested StructType with a
>>> user specified schema, using the same internal logic as the json Data
>>> Source.
>>>
>>> Would love to hear any comments / suggestions.
>>>
>>> Michael
>>>
>>
>>
>


Re: Broadcast big dataset

2016-09-28 Thread Andrew Duffy
Have you tried upping executor memory? There's a separate spark conf for that: 
spark.executor.memory
In general driver configurations don't automatically apply to executors.





On Wed, Sep 28, 2016 at 7:03 AM -0700, "WangJianfei" 
 wrote:










Hi Devs
 In my application, i just broadcast a dataset(about 500M) to  the
ececutors(100+), I got a java heap error
Jmartad-7219.hadoop.jd.local:53591 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:48 INFO BlockManagerInfo: Added broadcast_9_piece19 in memory
on BJHC-Jmartad-9012.hadoop.jd.local:53197 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:49 INFO BlockManagerInfo: Added broadcast_9_piece8 in memory
on BJHC-Jmartad-84101.hadoop.jd.local:52044 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:58 INFO BlockManagerInfo: Removed broadcast_8_piece0 on
172.22.176.114:37438 in memory (size: 2.7 KB, free: 3.1 GB)
16/09/28 15:56:58 WARN TaskSetManager: Lost task 125.0 in stage 7.0 (TID
130, BJHC-Jmartad-9376.hadoop.jd.local): java.lang.OutOfMemoryError: Java
heap space
at 
java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3465)
at
java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3271)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)

My configuration is 4G memory in driver.  Any advice is appreciated.
Thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org








Re: Spark SQL JSON Column Support

2016-09-28 Thread Michael Segel
Silly  question?
When you talk about ‘user specified schema’ do you mean for the user to supply 
an additional schema, or that you’re using the schema that’s described by the 
JSON string?
(or both? [either/or] )

Thx

On Sep 28, 2016, at 12:52 PM, Michael Armbrust 
> wrote:

Spark SQL has great support for reading text files that contain JSON data. 
However, in many cases the JSON data is just one column amongst others. This is 
particularly true when reading from sources such as Kafka. This 
PR adds a new functions from_json 
that converts a string column into a nested StructType with a user specified 
schema, using the same internal logic as the json Data Source.

Would love to hear any comments / suggestions.

Michael



Re: Spark SQL JSON Column Support

2016-09-28 Thread Burak Yavuz
I would really love something like this! It would be great if it doesn't
throw away corrupt_records like the Data Source.

On Wed, Sep 28, 2016 at 11:02 AM, Nathan Lande 
wrote:

> We are currently pulling out the JSON columns, passing them through
> read.json, and then joining them back onto the initial DF so something like
> from_json would be a nice quality of life improvement for us.
>
> On Wed, Sep 28, 2016 at 10:52 AM, Michael Armbrust  > wrote:
>
>> Spark SQL has great support for reading text files that contain JSON
>> data. However, in many cases the JSON data is just one column amongst
>> others. This is particularly true when reading from sources such as Kafka. 
>> This
>> PR  adds a new functions
>> from_json that converts a string column into a nested StructType with a
>> user specified schema, using the same internal logic as the json Data
>> Source.
>>
>> Would love to hear any comments / suggestions.
>>
>> Michael
>>
>
>


Spark SQL JSON Column Support

2016-09-28 Thread Michael Armbrust
Spark SQL has great support for reading text files that contain JSON data.
However, in many cases the JSON data is just one column amongst others.
This is particularly true when reading from sources such as Kafka. This PR
 adds a new functions
from_json that
converts a string column into a nested StructType with a user specified
schema, using the same internal logic as the json Data Source.

Would love to hear any comments / suggestions.

Michael


Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Sean Owen
I guess I'm claiming the artifacts wouldn't even be different in the first
place, because the Hadoop APIs that are used are all the same across these
versions. That would be the thing that makes you need multiple versions of
the artifact under multiple classifiers.

On Wed, Sep 28, 2016 at 1:16 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> ok, don't you think it could be published with just different classifiers
> hadoop-2.6
> hadoop-2.4
> hadoop-2.2 being the current default.
>
> So for now, I should just override spark 2.0.0's dependencies with the
> ones defined in the pom profile
>
>
>
> On Thu, Sep 22, 2016 11:17 AM, Sean Owen so...@cloudera.com wrote:
>
>> There can be just one published version of the Spark artifacts and they
>> have to depend on something, though in truth they'd be binary-compatible
>> with anything 2.2+. So you merely manage the dependency versions up to the
>> desired version in your .
>>
>> On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>> Hi,
>> when we fetch Spark 2.0.0 as maven dependency then we automatically end
>> up with hadoop 2.2 as a transitive dependency, I know multiple profiles are
>> used to generate the different tar.gz bundles that we can download, Is
>> there by any chance publications of Spark 2.0.0 with different classifier
>> according to different versions of Hadoop available ?
>>
>> Thanks for your time !
>>
>> *Olivier Girardot*
>>
>>
>>
>
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>


Re: Using Spark as a Maven dependency but with Hadoop 2.6

2016-09-28 Thread Olivier Girardot
ok, don't you think it could be published with just different classifiers
hadoop-2.6hadoop-2.4
hadoop-2.2 being the current default.
So for now, I should just override spark 2.0.0's dependencies with the ones
defined in the pom profile
 





On Thu, Sep 22, 2016 11:17 AM, Sean Owen so...@cloudera.com
wrote:
There can be just one published version of the Spark artifacts and they have to
depend on something, though in truth they'd be binary-compatible with anything
2.2+. So you merely manage the dependency versions up to the desired version in
your .
On Thu, Sep 22, 2016 at 7:05 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
Hi,when we fetch Spark 2.0.0 as maven dependency then we automatically end up
with hadoop 2.2 as a transitive dependency, I know multiple profiles are used to
generate the different tar.gz bundles that we can download, Is there by any
chance publications of Spark 2.0.0 with different classifier according to
different versions of Hadoop available ?
Thanks for your time !
Olivier Girardot

 


Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: [discuss] Spark 2.x release cadence

2016-09-28 Thread Tom Graves
+1 to 4 months.
Tom 

On Tuesday, September 27, 2016 2:07 PM, Reynold Xin  
wrote:
 

 We are 2 months past releasing Spark 2.0.0, an important milestone for the 
project. Spark 2.0.0 deviated (took 6 month from the regular release cadence we 
had for the 1.x line, and we never explicitly discussed what the release 
cadence should look like for 2.x. Thus this email.
During Spark 1.x, roughly every three months we make a new 1.x feature release 
(e.g. 1.5.0 comes out three months after 1.4.0). Development happened primarily 
in the first two months, and then a release branch was cut at the end of month 
2, and the last month was reserved for QA and release preparation.
During 2.0.0 development, I really enjoyed the longer release cycle because 
there was a lot of major changes happening and the longer time was critical for 
thinking through architectural changes as well as API design. While I don't 
expect the same degree of drastic changes in a 2.x feature release, I do think 
it'd make sense to increase the length of release cycle so we can make better 
designs.
My strawman proposal is to maintain a regular release cadence, as we did in 
Spark 1.x, and increase the cycle from 3 months to 4 months. This effectively 
gives us ~50% more time to develop (in reality it'd be slightly less than 50% 
since longer dev time also means longer QA time). As for maintenance releases, 
I think those should still be cut on-demand, similar to Spark 1.x, but more 
aggressively.
To put this into perspective, 4-month cycle means we will release Spark 2.1.0 
at the end of Nov or early Dec (and branch cut / code freeze at the end of Oct).
I am curious what others think.



   

Broadcast big dataset

2016-09-28 Thread WangJianfei
Hi Devs
 In my application, i just broadcast a dataset(about 500M) to  the
ececutors(100+), I got a java heap error
Jmartad-7219.hadoop.jd.local:53591 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:48 INFO BlockManagerInfo: Added broadcast_9_piece19 in memory
on BJHC-Jmartad-9012.hadoop.jd.local:53197 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:49 INFO BlockManagerInfo: Added broadcast_9_piece8 in memory
on BJHC-Jmartad-84101.hadoop.jd.local:52044 (size: 4.0 MB, free: 3.3 GB)
16/09/28 15:56:58 INFO BlockManagerInfo: Removed broadcast_8_piece0 on
172.22.176.114:37438 in memory (size: 2.7 KB, free: 3.1 GB)
16/09/28 15:56:58 WARN TaskSetManager: Lost task 125.0 in stage 7.0 (TID
130, BJHC-Jmartad-9376.hadoop.jd.local): java.lang.OutOfMemoryError: Java
heap space
at 
java.io.ObjectInputStream$HandleTable.grow(ObjectInputStream.java:3465)
at
java.io.ObjectInputStream$HandleTable.assign(ObjectInputStream.java:3271)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1789)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)

My configuration is 4G memory in driver.  Any advice is appreciated.
Thank you!



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Broadcast-big-dataset-tp19127.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Spark Executor Lost issue

2016-09-28 Thread Aditya

Hi All,

Any updates on this?

On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote:
Try with increasing the parallelism by repartitioning and also you may 
increase - spark.default.parallelism

You can also try with decreasing num-executor cores.
Basically, this happens when the executor is using quite large memory 
than it asked; and yarn kills the executor.


Regards,

Sushrut Ikhar
https://about.me/sushrutikhar




On Wed, Sep 28, 2016 at 12:17 PM, Aditya 
> wrote:


I have a spark job which runs fine for small data. But when data
increases it gives executor lost error.My executor and driver
memory are set at its highest point. I have also tried
increasing--conf spark.yarn.executor.memoryOverhead=600but still
not able to fix the problem. Is there any other solution to fix
the problem?








Re: IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Marcin Tustin
I've solved this in the past by using a thread pool which runs clean up
code on thread creation, to clear out stale values.

On Wednesday, September 28, 2016, Grant Digby  wrote:

> Hi,
>
> We've received the following error a handful of times and once it's
> occurred
> all subsequent queries fail with the same exception until we bounce the
> instance:
>
> IllegalArgumentException: spark.sql.execution.id is already set
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(
> SQLExecution.scala:77)
> at
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)
>
> ForkJoinWorkerThreads call into SQLExecution#withNewExecutionId, are
> assigned an execution Id into their InheritableThreadLocal and this is
> later
> cleared in the finally block.
> I've noted that these ForkJoinWorkerThreads can create additional
> ForkJoinWorkerThreads and (as of SPARK-10563) the child threads receive a
> copy of the parent's properties.
> It seems that Prior to SPARK-10563, clearing the parent's executionId would
> have cleared the child's, but now it's a copy of the properties the child's
> executionId is never cleared leading to the above exception.
> I'm yet to recreate the issue locally, whilst I've seen
> ForkJoinWorkerThreads creating others and the properties being copied
> across
> I've not seen this from within the body of withNewExecutionId.
>
> Does this all sound reasonable?
> Our plan for a short term work around is to allow the condition to arise
> but
> remove the execution.id from the thread local before throwing the
> IllegalArgumentException so it succeeds on re-try.
>
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/IllegalArgumentException-
> spark-sql-execution-id-is-already-set-tp19124.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 

Latest news  at Handy
Handy just raised $50m 

 led 
by Fidelity



Re: java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Sean Owen
My guess is that Kryo specially handles Maps generically or relies on
some mechanism that does, and it happens to iterate over all
key/values as part of that and of course there aren't actually any
key/values in the map. The Java serialization is a much more literal
(expensive) field-by-field serialization which works here because
there's no special treatment. I think you could register a custom
serializer that handles this case. Or work around it in your client
code. I know there have been other issues with Kryo and Map because,
for example, sometimes a Map in an application is actually some
non-serializable wrapper view.

On Wed, Sep 28, 2016 at 3:18 AM, Maciej Szymkiewicz
 wrote:
> Hi everyone,
>
> I suspect there is no point in submitting a JIRA to fix this (not a Spark
> issue?) but I would like to know if this problem is documented anywhere.
> Somehow Kryo is loosing default value during serialization:
>
> scala> import org.apache.spark.{SparkContext, SparkConf}
> import org.apache.spark.{SparkContext, SparkConf}
>
> scala> val aMap = Map[String, Long]().withDefaultValue(0L)
> aMap: scala.collection.immutable.Map[String,Long] = Map()
>
> scala> aMap("a")
> res6: Long = 0
>
> scala> val sc = new SparkContext(new
> SparkConf().setAppName("bar").set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer"))
>
> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
> 16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
> java.util.NoSuchElementException: key not found: a
>
> while Java serializer works just fine:
>
> scala> val sc = new SparkContext(new
> SparkConf().setAppName("bar").set("spark.serializer",
> "org.apache.spark.serializer.JavaSerializer"))
>
> scala> sc.parallelize(Seq(aMap)).map(_("a")).first
> res9: Long = 0
>
> --
> Best regards,
> Maciej

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.0.1 (RC3)

2016-09-28 Thread Sean Owen
(Process-wise there's no problem with that. The vote is open for at
least 3 days and ends when the RM says it ends. So it's valid anyway
as the vote is still open.)

On Tue, Sep 27, 2016 at 8:37 PM, Reynold Xin  wrote:
> So technically the vote has passed, but IMHO it does not make sense to
> release this and then immediately release 2.0.2. I will work on a new RC
> once SPARK-17666 and SPARK-17673 are fixed.
>
> Please shout if you disagree.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



IllegalArgumentException: spark.sql.execution.id is already set

2016-09-28 Thread Grant Digby
Hi,

We've received the following error a handful of times and once it's occurred
all subsequent queries fail with the same exception until we bounce the
instance:

IllegalArgumentException: spark.sql.execution.id is already set
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at
org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:2086)

ForkJoinWorkerThreads call into SQLExecution#withNewExecutionId, are
assigned an execution Id into their InheritableThreadLocal and this is later
cleared in the finally block.
I've noted that these ForkJoinWorkerThreads can create additional
ForkJoinWorkerThreads and (as of SPARK-10563) the child threads receive a
copy of the parent's properties.
It seems that Prior to SPARK-10563, clearing the parent's executionId would
have cleared the child's, but now it's a copy of the properties the child's
executionId is never cleared leading to the above exception. 
I'm yet to recreate the issue locally, whilst I've seen
ForkJoinWorkerThreads creating others and the properties being copied across
I've not seen this from within the body of withNewExecutionId.

Does this all sound reasonable? 
Our plan for a short term work around is to allow the condition to arise but
remove the execution.id from the thread local before throwing the
IllegalArgumentException so it succeeds on re-try.




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IllegalArgumentException-spark-sql-execution-id-is-already-set-tp19124.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



java.util.NoSuchElementException when serializing Map with default value

2016-09-28 Thread Maciej Szymkiewicz
Hi everyone,

I suspect there is no point in submitting a JIRA to fix this (not a
Spark issue?) but I would like to know if this problem is documented
anywhere. Somehow Kryo is loosing default value during serialization:

scala> import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.{SparkContext, SparkConf}

scala> val aMap = Map[String, Long]().withDefaultValue(0L)
aMap: scala.collection.immutable.Map[String,Long] = Map()

scala> aMap("a")
res6: Long = 0

scala> val sc = new SparkContext(new
SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.KryoSerializer"))

scala> sc.parallelize(Seq(aMap)).map(_("a")).first
16/09/28 09:13:47 ERROR Executor: Exception in task 2.0 in stage 2.0
(TID 7)
java.util.NoSuchElementException: key not found: a

while Java serializer works just fine:

scala> val sc = new SparkContext(new
SparkConf().setAppName("bar").set("spark.serializer",
"org.apache.spark.serializer.JavaSerializer"))

scala> sc.parallelize(Seq(aMap)).map(_("a")).first
res9: Long = 0

-- 
Best regards,
Maciej



Re: Spark Executor Lost issue

2016-09-28 Thread Aditya

:

Thanks Sushrut for the reply.

Currently I have not defined spark.default.parallelism property.
Can you let me know how much should I set it to?


Regards,
Aditya Calangutkar

On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote:
Try with increasing the parallelism by repartitioning and also you 
may increase - spark.default.parallelism

You can also try with decreasing num-executor cores.
Basically, this happens when the executor is using quite large memory 
than it asked; and yarn kills the executor.


Regards,

Sushrut Ikhar
https://about.me/sushrutikhar




On Wed, Sep 28, 2016 at 12:17 PM, Aditya 
> wrote:


I have a spark job which runs fine for small data. But when data
increases it gives executor lost error.My executor and driver
memory are set at its highest point. I have also tried
increasing--conf spark.yarn.executor.memoryOverhead=600but still
not able to fix the problem. Is there any other solution to fix
the problem?











Re: Spark Executor Lost issue

2016-09-28 Thread Aditya

Thanks Sushrut for the reply.

Currently I have not defined spark.default.parallelism property.
Can you let me know how much should I set it to?


Regards,
Aditya Calangutkar

On Wednesday 28 September 2016 12:22 PM, Sushrut Ikhar wrote:
Try with increasing the parallelism by repartitioning and also you may 
increase - spark.default.parallelism

You can also try with decreasing num-executor cores.
Basically, this happens when the executor is using quite large memory 
than it asked; and yarn kills the executor.


Regards,

Sushrut Ikhar
https://about.me/sushrutikhar




On Wed, Sep 28, 2016 at 12:17 PM, Aditya 
> wrote:


I have a spark job which runs fine for small data. But when data
increases it gives executor lost error.My executor and driver
memory are set at its highest point. I have also tried
increasing--conf spark.yarn.executor.memoryOverhead=600but still
not able to fix the problem. Is there any other solution to fix
the problem?








Spark Executor Lost issue

2016-09-28 Thread Aditya
I have a spark job which runs fine for small data. But when data 
increases it gives executor lost error.My executor and driver memory are 
set at its highest point. I have also tried increasing--conf 
spark.yarn.executor.memoryOverhead=600but still not able to fix the 
problem. Is there any other solution to fix the problem?