Why ShuffleMapTask has transient locs and preferredLocs?!

2017-01-03 Thread Jacek Laskowski
Hi, Just found out that ShuffleMapTask has transient locs and preferredLocs attributes which means that when ShuffleMapTask is serialized (as a broadcast variable) the information is gone. Does this mean that the attributes could have not been defined at all since Spark uses SortShuffleManager

Re: Tests failing with GC limit exceeded

2017-01-03 Thread shane knapp
nope, no changes to jenkins in the past few months. ganglia graphs show higher, but not worrying, memory usage on the workers when the jobs failed... i'll take a closer look later tonite/first thing tomorrow morning. shane On Tue, Jan 3, 2017 at 4:35 PM, Kay Ousterhout

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Liang-Chi Hsieh
Actually, I think UDTs can directly translates an object into Spark's internal format by ScalaReflection and encoder, without the intermediate generic row. You can directly create a dataset of the objects of UDT. If you don't convert the dataset to a dataframe, I think RowEncoder won't step in.

Re: Why ShuffleMapTask has transient locs and preferredLocs?!

2017-01-03 Thread Imran Rashid
Hi Jacek, I'm not entirely sure I understand your question, but the reason preferredLocs can be transient is b/c that is used to define where the scheduler (on the driver) should prefer to assign the task. But no matter the value, the task could still get assigned anywhere. By the time that

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Jacek Laskowski
Hi Shuai, Disclaimer: I'm not a spark guru, and what's written below are some notes I took when reading spark source code, so I could be wrong, in which case I'd appreciate a lot if someone could correct me. (Yes, I did copy your disclaimer since it applies to me too. Sorry for duplication :))

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Jacek Laskowski
Thanks Herman for the explanation. I silently assume that the other points were ok since you did not object? Correct? Pozdrawiam, Jacek Laskowski https://medium.com/@jaceklaskowski/ Mastering Apache Spark 2.0 https://bit.ly/mastering-apache-spark Follow me at

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

2017-01-03 Thread Herman van Hövell tot Westerflier
@Jacek The maximum output of 200 fields for whole stage code generation has been chosen to prevent the code generated method from exceeding the 64kb code limit. There absolutely no relation between this value and the number of partitions after a shuffle (if there were they should have used the

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati
Hi Reynold Xin, I tried setting spark.sql.files.ignoreCorruptFiles = true by using commands, val sqlContext =new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") but still

Re: Spark Improvement Proposals

2017-01-03 Thread Imran Rashid
I'm also in favor of this. Thanks for your persistence Cody. My take on the specific issues Joseph mentioned: 1) voting vs. consensus -- I agree with the argument Ryan Blue made earlier for consensus: > Majority vs consensus: My rationale is that I don't think we want to consider a proposal

Re: StateStoreSaveExec / StateStoreRestoreExec

2017-01-03 Thread Michael Armbrust
You might also be interested in this: https://issues.apache.org/jira/browse/SPARK-19031 On Tue, Jan 3, 2017 at 3:36 PM, Michael Armbrust wrote: > I think we should add something similar to mapWithState in 2.2. It would > be great if you could add the description of your

Re: Spark Improvement Proposals

2017-01-03 Thread Cody Koeninger
I don't have a concern about voting vs consensus. I have a concern that whatever the decision making process is, it is explicitly announced on the ticket for the given proposal, with an explicit deadline, and an explicit outcome. On Tue, Jan 3, 2017 at 4:08 PM, Imran Rashid

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati . shah
Yes! Using spark 2.1 . I hope i am using right syntax for setting up conf. sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") Sent from my Samsung Galaxy smartphone. Original message From: Ryan Blue

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread khyati
Yes! Using spark 2.1.0 . I hope the command used to set the conf is correct. sqlContext.setConf("spark.sql.files.ignoreCorruptFiles","true") / sqlContext.sql("set spark.sql.files.ignoreCorruptFiles=true") -- View this message in context:

Re: Spark Improvement Proposals

2017-01-03 Thread Joseph Bradley
Hi Cody, Thanks for being persistent about this. I too would like to see this happen. Reviewing the thread, it sounds like the main things remaining are: * Decide about a few issues * Finalize the doc(s) * Vote on this proposal Issues & TODOs: (1) The main issue I see above is voting vs.

Re: ml word2vec finSynonyms return type

2017-01-03 Thread Asher Krim
The jira: https://issues.apache.org/jira/browse/SPARK-17629 Adding new methods could result in method clutter. Changing behavior of non-experimental classes is unfortunate (ml Word2Vec was marked Experimental until Spark 2.0). Neither option is great. If I had to pick, I would rather change the

Re: Skip Corrupted Parquet blocks / footer.

2017-01-03 Thread Ryan Blue
Khyati, Are you using Spark 2.1? The usual entry point for Spark 2.x is spark rather than sqlContext. rb ​ On Tue, Jan 3, 2017 at 11:03 AM, khyati wrote: > Hi Reynold Xin, > > I tried setting spark.sql.files.ignoreCorruptFiles = true by using > commands, > > val

Re: Apache Hive with Spark Configuration

2017-01-03 Thread Ryan Blue
Chetan, Spark is currently using Hive 1.2.1 to interact with the Metastore. Using that version for Hive is going to be the most reliable, but the metastore API doesn't change very often and we've found (from having different versions as well) that older versions are mostly compatible. Some things