Re: spark1.6.2 ClassNotFoundException: org.apache.parquet.hadoop.ParquetOutputCommitter

2016-07-07 Thread Sun Rui
maybe related to "parquet-provided”? 
remove "parquet-provided” profile when making the distribution or adding the 
parquet jar into class path when running Spark
> On Jul 8, 2016, at 09:25, kevin  wrote:
> 
> parquet-provided



Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
HiI am sorry but its still not clearDo you mean ./bin/spark-shell --master 
localAnd what I do after that killing the org.apache.spark.deploy.SparkSubmit 
--master local --class org.apache.spark.repl.Main --name Spark shell spark-shell
will kill the shell so I couldn't send the commands .Thanks 

On Friday, July 8, 2016 12:05 AM, Jacek Laskowski  wrote:
 

 Hi,

Then use --master with spark standalone, yarn, or mesos.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 7, 2016 at 10:35 PM, Mr rty ff  wrote:
> I don't think Its the proper way to recreate the bug becouse I should
> continue to send commands to the shell
> They talking about killing the CoarseGrainedExecutorBackend
>
>
> On Thursday, July 7, 2016 11:32 PM, Jacek Laskowski  wrote:
>
>
> Hi,
>
> It appears you're running local mode (local[*] assumed) so killing
> spark-shell *will* kill the one and only executor -- the driver :)
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Thu, Jul 7, 2016 at 10:27 PM, Mr rty ff  wrote:
>> This what I get when I run the command
>> 946 sun.tools.jps.Jps -lm
>> 7443 org.apache.spark.deploy.SparkSubmit --class
>> org.apache.spark.repl.Main
>> --name Spark shell spark-shell
>> I don't think that shululd kill SparkSubmit  process
>>
>>
>>
>> On Thursday, July 7, 2016 9:58 PM, Jacek Laskowski 
>> wrote:
>>
>>
>> Hi,
>>
>> Use jps -lm and see the processes on the machine(s) to kill.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> 
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff 
>> wrote:
>>> Hi
>>> I like to recreate this bug
>>> https://issues.apache.org/jira/browse/SPARK-13979
>>> They talking about stopping Spark executors. Its not clear exactly how do
>>> I
>>> stop the executors
>>> Thanks
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>>
>>
>>
>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

Re: Bad JIRA components

2016-07-07 Thread Nicholas Chammas
Thanks Reynold.

On Thu, Jul 7, 2016 at 5:03 PM Reynold Xin  wrote:

> I deleted those.
>
>
> On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel
>>
>> There are several bad components in there, like docs, MLilb, and sq;.
>> I’ve updated the issues that were assigned to them, but I don’t know if
>> there is a way to delete these components from the drop down so they don’t
>> get mistakenly selected again.
>>
>> Can a JIRA admin take a look?
>>
>> Nick
>> ​
>>
>
>


Re: Bad JIRA components

2016-07-07 Thread Reynold Xin
I deleted those.


On Thu, Jul 7, 2016 at 1:27 PM, Nicholas Chammas  wrote:

>
> https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel
>
> There are several bad components in there, like docs, MLilb, and sq;.
> I’ve updated the issues that were assigned to them, but I don’t know if
> there is a way to delete these components from the drop down so they don’t
> get mistakenly selected again.
>
> Can a JIRA admin take a look?
>
> Nick
> ​
>


Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Tom Graves
I think the problems comes in with your definition as well as peoples 
interpretation of that.  I don't agree with your statement of "where the "how" 
is different from the "what"".  
This could apply to a lot of things.  I could easily file a jira that says 
remove synchronization on routine x, then change a lock.  No discussion needed 
and the how is the same as the what.  But it could have huge impact on the code 
and definitely should have a jira.  It may be a contrived example but there is 
a lot of leeway in that.  Which is why I think Patrick originally sent this 
email and you said it yourself above that some of the things reverted  weren't 
trivial enough to begin with.  That just proves people can't make that 
judgement call themselves.  So why not just file a jira for everything?  
Another example of this could be doc changes.  you may think they are trivial 
but if someone changes the docs and remove configs or change the wording such 
that users don't understand, then to me it should have had a jira and possibly 
discussion before changing.  
So based on that it seems like spending the 5 to 30 seconds to file a jira 
would only help in tracking things and isn't much overhead.  
We also base our release notes and other things on jira. 
Also for hotfixes I think they should have the original jira or a separate jira 
(with brokenby linked to original), again for tracking purposes. If we check 
something into master and then later want to cherry-pick it back, I might just 
pick the original commit and totally miss this "HOTFIX" that was required if 
they aren't properly linked.
Tom 

On Thursday, July 7, 2016 2:56 PM, Sean Owen  wrote:
 

 I don't agree that every change needs a JIRA, myself. Really, we
didn't choose to have this system split across JIRA and Github PRs.
It's necessitated by how the ASF works (and with some good reasons).
But while we have this dual system, I figure, let's try to make some
sense of it.

I think it makes sense to make a JIRA for any non-trivial change.
What's non-trivial? where the "how" is different from the "what". That
is, if the JIRA is not just a repeat of the pull request, they should
probably be separate. But, if the change is so simple that describing
it amounts to dictating how it's implemented -- well, seems like a
JIRA is just overhead.

ONe problem that I think happened above was: pretty non-trivial things
were being merged without a JIRA. The evidence? they were reverted.
That means their effect was not quite obvious. They probably deserved
more discussion. Anything that needs some discussion probably deserves
a JIRA.

Also: we have some hot-fixes here that aren't connected to JIRAs.
Either they belong with an existing JIRA and aren't tagged correctly,
or, again, are patching changes that weren't really trivial enough to
skip a JIRA to begin with.

On Thu, Jul 7, 2016 at 7:47 PM, Tom Graves  wrote:
> Popping this back up to the dev list again.  I see a bunch of checkins with
> minor or hotfix.
>
> It seems to me we shouldn't be doing this, but I would like to hear thoughts
> from others.  I see no reason we can't have a jira for each of those issues,
> it only takes a few seconds to file one and it makes things much easier to
> track.
>
> For instance, I tend to watch the jiras on the mailing list and if I hit an
> issue I search jira to see if there is existing one for it, but if there
> isn't jira then I don't see and can't find what someone perhaps already
> fixed with a [MINOR] checkin.
>
> Tom
>
>
> On Saturday, June 6, 2015 11:02 AM, Patrick Wendell 
> wrote:
>
>
> Hey All,
>
> Just a request here - it would be great if people could create JIRA's
> for any and all merged pull requests. The reason is that when patches
> get reverted due to build breaks or other issues, it is very difficult
> to keep track of what is going on if there is no JIRA. Here is a list
> of 5 patches we had to revert recently that didn't include a JIRA:
>
>    Revert "[MINOR] [BUILD] Use custom temp directory during build."
>    Revert "[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in
> HiveThriftServer2Test to ensure expected logging behavior"
>    Revert "[BUILD] Always run SQL tests in master build."
>    Revert "[MINOR] [CORE] Warn users who try to cache RDDs with
> dynamic allocation on."
>    Revert "[HOT FIX] [YARN] Check whether `/lib` exists before
> listing its files"
>
> The cost overhead of creating a JIRA relative to other aspects of
> development is very small. If it's *really* a documentation change or
> something small, that's okay.
>
> But anything affecting the build, packaging, etc. These all need to
> have a JIRA to ensure that follow-up can be well communicated to all
> Spark developers.
>
> Hopefully this is something everyone can get behind, but opened a
> discussion here in case others feel differently.
>
> - Patrick
>
> 

Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
I don't think Its the proper way to recreate the bug becouse I should continue 
to send commands to the shellThey talking about killing the 
CoarseGrainedExecutorBackend

On Thursday, July 7, 2016 11:32 PM, Jacek Laskowski  wrote:
 

 Hi,

It appears you're running local mode (local[*] assumed) so killing
spark-shell *will* kill the one and only executor -- the driver :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 7, 2016 at 10:27 PM, Mr rty ff  wrote:
> This what I get when I run the command
> 946 sun.tools.jps.Jps -lm
> 7443 org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
> --name Spark shell spark-shell
> I don't think that shululd kill SparkSubmit  process
>
>
>
> On Thursday, July 7, 2016 9:58 PM, Jacek Laskowski  wrote:
>
>
> Hi,
>
> Use jps -lm and see the processes on the machine(s) to kill.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff  wrote:
>> Hi
>> I like to recreate this bug
>> https://issues.apache.org/jira/browse/SPARK-13979
>> They talking about stopping Spark executors. Its not clear exactly how do
>> I
>> stop the executors
>> Thanks
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

Re: Stopping Spark executors

2016-07-07 Thread Jacek Laskowski
Hi,

It appears you're running local mode (local[*] assumed) so killing
spark-shell *will* kill the one and only executor -- the driver :)

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Thu, Jul 7, 2016 at 10:27 PM, Mr rty ff  wrote:
> This what I get when I run the command
> 946 sun.tools.jps.Jps -lm
> 7443 org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main
> --name Spark shell spark-shell
> I don't think that shululd kill SparkSubmit  process
>
>
>
> On Thursday, July 7, 2016 9:58 PM, Jacek Laskowski  wrote:
>
>
> Hi,
>
> Use jps -lm and see the processes on the machine(s) to kill.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff  wrote:
>> Hi
>> I like to recreate this bug
>> https://issues.apache.org/jira/browse/SPARK-13979
>> They talking about stopping Spark executors. Its not clear exactly how do
>> I
>> stop the executors
>> Thanks
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Stopping Spark executors

2016-07-07 Thread Mr rty ff
This what I get when I run the command946 sun.tools.jps.Jps -lm7443 
org.apache.spark.deploy.SparkSubmit --class org.apache.spark.repl.Main --name 
Spark shell spark-shellI don't think that shululd kill SparkSubmit  process
 

On Thursday, July 7, 2016 9:58 PM, Jacek Laskowski  wrote:
 

 Hi,

Use jps -lm and see the processes on the machine(s) to kill.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff  wrote:
> Hi
> I like to recreate this bug
> https://issues.apache.org/jira/browse/SPARK-13979
> They talking about stopping Spark executors. Its not clear exactly how do I
> stop the executors
> Thanks

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



  

Bad JIRA components

2016-07-07 Thread Nicholas Chammas
https://issues.apache.org/jira/browse/SPARK/?selectedTab=com.atlassian.jira.jira-projects-plugin:components-panel

There are several bad components in there, like docs, MLilb, and sq;. I’ve
updated the issues that were assigned to them, but I don’t know if there is
a way to delete these components from the drop down so they don’t get
mistakenly selected again.

Can a JIRA admin take a look?

Nick
​


Re: Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
JIRA is here: https://issues.apache.org/jira/browse/SPARK-16427

On Thu, Jul 7, 2016 at 3:18 PM Reynold Xin  wrote:

> Please create a patch. Thanks!
>
>
> On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m looking at the docs here:
>>
>>
>> http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel
>> 
>>
>> A newcomer to Spark won’t understand the meaning of _2, or the meaning
>> of _SER (or its value), and won’t understand how exactly memory and disk
>> play together when something like MEMORY_AND_DISK is selected.
>>
>> Is there a place in the docs that expands on the storage levels a bit? If
>> not, shall we create a JIRA and expand this documentation? I don’t mind
>> taking on this task, though frankly I’m interested in this because I don’t
>> fully understand the differences myself. :)
>>
>> Nick
>> ​
>>
>
>


Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Sean Owen
I don't agree that every change needs a JIRA, myself. Really, we
didn't choose to have this system split across JIRA and Github PRs.
It's necessitated by how the ASF works (and with some good reasons).
But while we have this dual system, I figure, let's try to make some
sense of it.

I think it makes sense to make a JIRA for any non-trivial change.
What's non-trivial? where the "how" is different from the "what". That
is, if the JIRA is not just a repeat of the pull request, they should
probably be separate. But, if the change is so simple that describing
it amounts to dictating how it's implemented -- well, seems like a
JIRA is just overhead.

ONe problem that I think happened above was: pretty non-trivial things
were being merged without a JIRA. The evidence? they were reverted.
That means their effect was not quite obvious. They probably deserved
more discussion. Anything that needs some discussion probably deserves
a JIRA.

Also: we have some hot-fixes here that aren't connected to JIRAs.
Either they belong with an existing JIRA and aren't tagged correctly,
or, again, are patching changes that weren't really trivial enough to
skip a JIRA to begin with.

On Thu, Jul 7, 2016 at 7:47 PM, Tom Graves  wrote:
> Popping this back up to the dev list again.  I see a bunch of checkins with
> minor or hotfix.
>
> It seems to me we shouldn't be doing this, but I would like to hear thoughts
> from others.  I see no reason we can't have a jira for each of those issues,
> it only takes a few seconds to file one and it makes things much easier to
> track.
>
> For instance, I tend to watch the jiras on the mailing list and if I hit an
> issue I search jira to see if there is existing one for it, but if there
> isn't jira then I don't see and can't find what someone perhaps already
> fixed with a [MINOR] checkin.
>
> Tom
>
>
> On Saturday, June 6, 2015 11:02 AM, Patrick Wendell 
> wrote:
>
>
> Hey All,
>
> Just a request here - it would be great if people could create JIRA's
> for any and all merged pull requests. The reason is that when patches
> get reverted due to build breaks or other issues, it is very difficult
> to keep track of what is going on if there is no JIRA. Here is a list
> of 5 patches we had to revert recently that didn't include a JIRA:
>
> Revert "[MINOR] [BUILD] Use custom temp directory during build."
> Revert "[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in
> HiveThriftServer2Test to ensure expected logging behavior"
> Revert "[BUILD] Always run SQL tests in master build."
> Revert "[MINOR] [CORE] Warn users who try to cache RDDs with
> dynamic allocation on."
> Revert "[HOT FIX] [YARN] Check whether `/lib` exists before
> listing its files"
>
> The cost overhead of creating a JIRA relative to other aspects of
> development is very small. If it's *really* a documentation change or
> something small, that's okay.
>
> But anything affecting the build, packaging, etc. These all need to
> have a JIRA to ensure that follow-up can be well communicated to all
> Spark developers.
>
> Hopefully this is something everyone can get behind, but opened a
> discussion here in case others feel differently.
>
> - Patrick
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Michael Allman
FYI if you just want to look at the source code, there are source jars for 
those binary versions in maven central. I was just looking at the metastore 
source code last night.

Michael

> On Jul 7, 2016, at 12:13 PM, Jonathan Kelly  wrote:
> 
> I'm not sure, but I think it's 
> https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2 
> .
> 
> It would be really nice though to have this whole process better documented 
> and more "official" than just building from somebody's personal fork of Hive.
> 
> Or is there some way that the Spark community could contribute back these 
> changes to Hive in such a way that they would accept them into trunk? Then 
> Spark could depend upon an official version of Hive rather than this fork.
> 
> ~ Jonathan
> 
> On Thu, Jul 7, 2016 at 11:46 AM Marcelo Vanzin  > wrote:
> (Actually that's "spark" and not "spark2", so yeah, that doesn't
> really answer the question.)
> 
> On Thu, Jul 7, 2016 at 11:38 AM, Marcelo Vanzin  > wrote:
> > My guess would be https://github.com/pwendell/hive/tree/release-1.2.1-spark 
> > 
> >
> > On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang  > > wrote:
> >> I saw the pom file having hive version as
> >> 1.2.1.spark2. But I cannot find the branch in
> >> https://github.com/pwendell/ 
> >>
> >> Does anyone know where the repo is?
> >>
> >> Thanks.
> >>
> >> Zhan Zhang
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context: 
> >> http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
> >>  
> >> 
> >> Sent from the Apache Spark Developers List mailing list archive at 
> >> Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> 
> >>
> >
> >
> >
> > --
> > Marcelo
> 
> 
> 
> --
> Marcelo
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> 
> 



Re: Expanded docs for the various storage levels

2016-07-07 Thread Reynold Xin
Please create a patch. Thanks!


On Thu, Jul 7, 2016 at 12:07 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I’m looking at the docs here:
>
>
> http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel
> 
>
> A newcomer to Spark won’t understand the meaning of _2, or the meaning of
> _SER (or its value), and won’t understand how exactly memory and disk
> play together when something like MEMORY_AND_DISK is selected.
>
> Is there a place in the docs that expands on the storage levels a bit? If
> not, shall we create a JIRA and expand this documentation? I don’t mind
> taking on this task, though frankly I’m interested in this because I don’t
> fully understand the differences myself. :)
>
> Nick
> ​
>


Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Jonathan Kelly
I'm not sure, but I think it's
https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2.

It would be really nice though to have this whole process better documented
and more "official" than just building from somebody's personal fork of
Hive.

Or is there some way that the Spark community could contribute back these
changes to Hive in such a way that they would accept them into trunk? Then
Spark could depend upon an official version of Hive rather than this fork.

~ Jonathan

On Thu, Jul 7, 2016 at 11:46 AM Marcelo Vanzin  wrote:

> (Actually that's "spark" and not "spark2", so yeah, that doesn't
> really answer the question.)
>
> On Thu, Jul 7, 2016 at 11:38 AM, Marcelo Vanzin 
> wrote:
> > My guess would be
> https://github.com/pwendell/hive/tree/release-1.2.1-spark
> >
> > On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang  wrote:
> >> I saw the pom file having hive version as
> >> 1.2.1.spark2. But I cannot find the branch
> in
> >> https://github.com/pwendell/
> >>
> >> Does anyone know where the repo is?
> >>
> >> Thanks.
> >>
> >> Zhan Zhang
> >>
> >>
> >>
> >>
> >> --
> >> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
> >> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
> >
> >
> > --
> > Marcelo
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Expanded docs for the various storage levels

2016-07-07 Thread Nicholas Chammas
I’m looking at the docs here:

http://spark.apache.org/docs/1.6.2/api/python/pyspark.html#pyspark.StorageLevel


A newcomer to Spark won’t understand the meaning of _2, or the meaning of
_SER (or its value), and won’t understand how exactly memory and disk play
together when something like MEMORY_AND_DISK is selected.

Is there a place in the docs that expands on the storage levels a bit? If
not, shall we create a JIRA and expand this documentation? I don’t mind
taking on this task, though frankly I’m interested in this because I don’t
fully understand the differences myself. :)

Nick
​


Re: Stopping Spark executors

2016-07-07 Thread Jacek Laskowski
Hi,

Use jps -lm and see the processes on the machine(s) to kill.

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Wed, Jul 6, 2016 at 9:49 PM, Mr rty ff  wrote:
> Hi
> I like to recreate this bug
> https://issues.apache.org/jira/browse/SPARK-13979
> They talking about stopping Spark executors. Its not clear exactly how do I
> stop the executors
> Thanks

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Reynold Xin
When using native data sources (e.g. Parquet, ORC, JSON, ...), partitions
are automatically merged so they would add up to a specific size,
configurable by spark.sql.files.maxPartitionBytes.

spark.sql.files.openCostInBytes is used to specify the cost of each "file".
That is, an empty file will be considered to have at
least spark.sql.files.openCostInBytes bytes.

On Wed, Jul 6, 2016 at 11:53 PM, Ajay Srivastava <
a_k_srivast...@yahoo.com.invalid> wrote:

> Hi,
>
> This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in
> spark 2.0.
> But resolution is not mentioned there.
>
> In our use case, there are big as well as many small parquet files which
> are being queried using spark sql.
> Can someone please explain what is the fix and how I can use it in spark
> 2.0 ? I did search commits done in 2.0 branch and looks like I need to
> use spark.sql.files.openCostInBytes but I am not sure.
>
>
> Regards,
> Ajay
>


Re: [DISCUSS] Minimize use of MINOR, BUILD, and HOTFIX w/ no JIRA

2016-07-07 Thread Tom Graves
Popping this back up to the dev list again.  I see a bunch of checkins with 
minor or hotfix.  
It seems to me we shouldn't be doing this, but I would like to hear thoughts 
from others.  I see no reason we can't have a jira for each of those issues, it 
only takes a few seconds to file one and it makes things much easier to track.
For instance, I tend to watch the jiras on the mailing list and if I hit an 
issue I search jira to see if there is existing one for it, but if there isn't 
jira then I don't see and can't find what someone perhaps already fixed with a 
[MINOR] checkin.
Tom 

On Saturday, June 6, 2015 11:02 AM, Patrick Wendell  
wrote:
 

 Hey All,

Just a request here - it would be great if people could create JIRA's
for any and all merged pull requests. The reason is that when patches
get reverted due to build breaks or other issues, it is very difficult
to keep track of what is going on if there is no JIRA. Here is a list
of 5 patches we had to revert recently that didn't include a JIRA:

    Revert "[MINOR] [BUILD] Use custom temp directory during build."
    Revert "[SQL] [TEST] [MINOR] Uses a temporary log4j.properties in
HiveThriftServer2Test to ensure expected logging behavior"
    Revert "[BUILD] Always run SQL tests in master build."
    Revert "[MINOR] [CORE] Warn users who try to cache RDDs with
dynamic allocation on."
    Revert "[HOT FIX] [YARN] Check whether `/lib` exists before
listing its files"

The cost overhead of creating a JIRA relative to other aspects of
development is very small. If it's *really* a documentation change or
something small, that's okay.

But anything affecting the build, packaging, etc. These all need to
have a JIRA to ensure that follow-up can be well communicated to all
Spark developers.

Hopefully this is something everyone can get behind, but opened a
discussion here in case others feel differently.

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



  

Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Marcelo Vanzin
(Actually that's "spark" and not "spark2", so yeah, that doesn't
really answer the question.)

On Thu, Jul 7, 2016 at 11:38 AM, Marcelo Vanzin  wrote:
> My guess would be https://github.com/pwendell/hive/tree/release-1.2.1-spark
>
> On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang  wrote:
>> I saw the pom file having hive version as
>> 1.2.1.spark2. But I cannot find the branch in
>> https://github.com/pwendell/
>>
>> Does anyone know where the repo is?
>>
>> Thanks.
>>
>> Zhan Zhang
>>
>>
>>
>>
>> --
>> View this message in context: 
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
>> Sent from the Apache Spark Developers List mailing list archive at 
>> Nabble.com.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>
>
>
> --
> Marcelo



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Marcelo Vanzin
My guess would be https://github.com/pwendell/hive/tree/release-1.2.1-spark

On Thu, Jul 7, 2016 at 11:37 AM, Zhan Zhang  wrote:
> I saw the pom file having hive version as
> 1.2.1.spark2. But I cannot find the branch in
> https://github.com/pwendell/
>
> Does anyone know where the repo is?
>
> Thanks.
>
> Zhan Zhang
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Anyone knows the hive repo for spark-2.0?

2016-07-07 Thread Zhan Zhang
I saw the pom file having hive version as
1.2.1.spark2. But I cannot find the branch in 
https://github.com/pwendell/

Does anyone know where the repo is?

Thanks.

Zhan Zhang




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Anyone-knows-the-hive-repo-for-spark-2-0-tp18234.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Why the org.apache.spark.sql.catalyst.expressions.SortArray is with CodegenFallback?

2016-07-07 Thread 楊閔富
I found CollapseCodengenStages.supportCodegen(e: Expression) will determine
SortArray expression not CodegenSupported since SortArray is with
CodegenFallback. Can I ask why the SortArray is not CodeGenSupoort??


Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them
using pipes, sending the user's code and the data to be processed.

I am trying to understand  the implementation of how this data transfer is
happening  using pipes.
Can anyone please guide me along that line??

Thanks,
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui"  wrote:

> You can read
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> For pySpark data flow on worker nodes, you can read the source code of
> PythonRDD.scala. Python worker processes communicate with Spark executors
> via sockets instead of pipes.
>
> On Jul 7, 2016, at 15:49, Amit Rana  wrote:
>
> Hi all,
>
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
> in windows 7.
> I had submitted  a python  job as follows:
> --master local[4]  
>
> I have made the following  insights after running the above command in
> debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
> which it communicates through socket.
> ->py4j is used to handle this communication
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
> which communicates with the spark executors in cluster.
>
> In cluster I have read that the data flow between spark executors and
> python interpreter happens using pipes. But I am not able to trace that
> data flow.
>
> Please correct me if my understanding is wrong. It would be very helpful
> if, someone can help me understand tge code-flow for data transfer between
> JVM and python workers.
>
> Thanks,
> Amit Rana
>
>
>


Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Niranda Perera
Hi Mark,

I agree. :-) We already have a product released with Spark 1.4.1 with some
custom extensions and now we are doing a patch release. We will update
Spark to the latest 2.x version in the next release.

Best

On Thu, Jul 7, 2016 at 1:12 PM, Mark Hamstra 
wrote:

> You've got to satisfy my curiosity, though.  Why would you want to run
> such a badly out-of-date version in production?  I mean, 2.0.0 is just
> about ready for release, and lagging three full releases behind, with one
> of them being a major version release, is a long way from where Spark is
> now.
>
> On Wed, Jul 6, 2016 at 11:12 PM, Niranda Perera 
> wrote:
>
>> Thanks Reynold
>>
>> On Thu, Jul 7, 2016 at 11:40 AM, Reynold Xin  wrote:
>>
>>> Yes definitely.
>>>
>>>
>>> On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera <
>>> niranda.per...@gmail.com> wrote:
>>>
 Thanks Reynold for the prompt response. Do you think we could use a
 1.4-branch latest build in a production environment?



 On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin 
 wrote:

> I think last time I tried I had some trouble releasing it because the
> release scripts no longer work with branch-1.4. You can build from the
> branch yourself, but it might be better to upgrade to the later versions.
>
> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera <
> niranda.per...@gmail.com> wrote:
>
>> Hi guys,
>>
>> May I know if you have halted development in the Spark 1.4 branch? I
>> see that there is a release tag for 1.4.2 but it was never released.
>>
>> Can we expect a 1.4.x bug fixing release anytime soon?
>>
>> Best
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


 --
 Niranda
 @n1r44 
 +94-71-554-8430
 https://pythagoreanscript.wordpress.com/

>>>
>>>
>>
>>
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Understanding pyspark data flow on worker nodes

2016-07-07 Thread Sun Rui
You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals 

For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors via 
sockets instead of pipes.

> On Jul 7, 2016, at 15:49, Amit Rana  wrote:
> 
> Hi all,
> 
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA in 
> windows 7.
> I had submitted  a python  job as follows:
> --master local[4]  
> 
> I have made the following  insights after running the above command in debug 
> mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with 
> which it communicates through socket.
> ->py4j is used to handle this communication 
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
> which communicates with the spark executors in cluster.
> 
> In cluster I have read that the data flow between spark executors and python 
> interpreter happens using pipes. But I am not able to trace that data flow.
> 
> Please correct me if my understanding is wrong. It would be very helpful if, 
> someone can help me understand tge code-flow for data transfer between JVM 
> and python workers.
> 
> Thanks,
> Amit Rana
> 



Understanding pyspark data flow on worker nodes

2016-07-07 Thread Amit Rana
Hi all,

I am trying  to trace the data flow in pyspark. I am using intellij IDEA in
windows 7.
I had submitted  a python  job as follows:
--master local[4]  

I have made the following  insights after running the above command in
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with
which it communicates through socket.
->py4j is used to handle this communication
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
which communicates with the spark executors in cluster.

In cluster I have read that the data flow between spark executors and
python interpreter happens using pipes. But I am not able to trace that
data flow.

Please correct me if my understanding is wrong. It would be very helpful
if, someone can help me understand tge code-flow for data transfer between
JVM and python workers.

Thanks,
Amit Rana


Re: SPARK-8813 - combining small files in spark sql

2016-07-07 Thread Sean Owen
-user

Reynold made the comment that he thinks this was resolved by another
change; maybe he can comment.

On Thu, Jul 7, 2016 at 7:53 AM, Ajay Srivastava
 wrote:
> Hi,
>
> This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark
> 2.0.
> But resolution is not mentioned there.
>
> In our use case, there are big as well as many small parquet files which are
> being queried using spark sql.
> Can someone please explain what is the fix and how I can use it in spark 2.0
> ? I did search commits done in 2.0 branch and looks like I need to use
> spark.sql.files.openCostInBytes but I am not sure.
>
>
> Regards,
> Ajay

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Mark Hamstra
You've got to satisfy my curiosity, though.  Why would you want to run such
a badly out-of-date version in production?  I mean, 2.0.0 is just about
ready for release, and lagging three full releases behind, with one of them
being a major version release, is a long way from where Spark is now.

On Wed, Jul 6, 2016 at 11:12 PM, Niranda Perera 
wrote:

> Thanks Reynold
>
> On Thu, Jul 7, 2016 at 11:40 AM, Reynold Xin  wrote:
>
>> Yes definitely.
>>
>>
>> On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera > > wrote:
>>
>>> Thanks Reynold for the prompt response. Do you think we could use a
>>> 1.4-branch latest build in a production environment?
>>>
>>>
>>>
>>> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin 
>>> wrote:
>>>
 I think last time I tried I had some trouble releasing it because the
 release scripts no longer work with branch-1.4. You can build from the
 branch yourself, but it might be better to upgrade to the later versions.

 On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera <
 niranda.per...@gmail.com> wrote:

> Hi guys,
>
> May I know if you have halted development in the Spark 1.4 branch? I
> see that there is a release tag for 1.4.2 but it was never released.
>
> Can we expect a 1.4.x bug fixing release anytime soon?
>
> Best
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


>>>
>>>
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


SparkSQL Added file get Exception: is a directory and recursive is not turned on

2016-07-07 Thread linxi zeng
Hi, all:
   As recorded in https://issues.apache.org/jira/browse/SPARK-16408, when
using Spark-sql to execute sql like:
   add file hdfs://xxx/user/test;
   If the HDFS path( hdfs://xxx/user/test) is a directory, then we will get
an exception like:

org.apache.spark.SparkException: Added file hdfs://xxx/user/test is a
directory and recursive is not turned on.
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1372)
   at org.apache.spark.SparkContext.addFile(SparkContext.scala:1340)
   at
org.apache.spark.sql.hive.execution.AddFile.run(commands.scala:117)
   at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
   at
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
   at
org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)


   I think we should add an parameter (spark.input.dir.recursive) to
control the value of recursive, and make this parameter works by modify
some code, like:

diff --git
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
index 6b16d59..3be8553 100644
---
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
+++
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/commands.scala
@@ -113,8 +113,9 @@ case class AddFile(path: String) extends
RunnableCommand {

   override def run(sqlContext: SQLContext): Seq[Row] = {
 val hiveContext = sqlContext.asInstanceOf[HiveContext]
+val recursive =
sqlContext.sparkContext.getConf.getBoolean("spark.input.dir.recursive",
false)
 hiveContext.runSqlHive(s"ADD FILE $path")
-hiveContext.sparkContext.addFile(path)
+hiveContext.sparkContext.addFile(path, recursive)
 Seq.empty[Row]
   }
 }


Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Niranda Perera
Thanks Reynold

On Thu, Jul 7, 2016 at 11:40 AM, Reynold Xin  wrote:

> Yes definitely.
>
>
> On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera 
> wrote:
>
>> Thanks Reynold for the prompt response. Do you think we could use a
>> 1.4-branch latest build in a production environment?
>>
>>
>>
>> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:
>>
>>> I think last time I tried I had some trouble releasing it because the
>>> release scripts no longer work with branch-1.4. You can build from the
>>> branch yourself, but it might be better to upgrade to the later versions.
>>>
>>> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera <
>>> niranda.per...@gmail.com> wrote:
>>>
 Hi guys,

 May I know if you have halted development in the Spark 1.4 branch? I
 see that there is a release tag for 1.4.2 but it was never released.

 Can we expect a 1.4.x bug fixing release anytime soon?

 Best
 --
 Niranda
 @n1r44 
 +94-71-554-8430
 https://pythagoreanscript.wordpress.com/

>>>
>>>
>>
>>
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Reynold Xin
Yes definitely.


On Wed, Jul 6, 2016 at 11:08 PM, Niranda Perera 
wrote:

> Thanks Reynold for the prompt response. Do you think we could use a
> 1.4-branch latest build in a production environment?
>
>
>
> On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:
>
>> I think last time I tried I had some trouble releasing it because the
>> release scripts no longer work with branch-1.4. You can build from the
>> branch yourself, but it might be better to upgrade to the later versions.
>>
>> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera > > wrote:
>>
>>> Hi guys,
>>>
>>> May I know if you have halted development in the Spark 1.4 branch? I see
>>> that there is a release tag for 1.4.2 but it was never released.
>>>
>>> Can we expect a 1.4.x bug fixing release anytime soon?
>>>
>>> Best
>>> --
>>> Niranda
>>> @n1r44 
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Niranda Perera
Thanks Reynold for the prompt response. Do you think we could use a
1.4-branch latest build in a production environment?



On Thu, Jul 7, 2016 at 11:33 AM, Reynold Xin  wrote:

> I think last time I tried I had some trouble releasing it because the
> release scripts no longer work with branch-1.4. You can build from the
> branch yourself, but it might be better to upgrade to the later versions.
>
> On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera 
> wrote:
>
>> Hi guys,
>>
>> May I know if you have halted development in the Spark 1.4 branch? I see
>> that there is a release tag for 1.4.2 but it was never released.
>>
>> Can we expect a 1.4.x bug fixing release anytime soon?
>>
>> Best
>> --
>> Niranda
>> @n1r44 
>> +94-71-554-8430
>> https://pythagoreanscript.wordpress.com/
>>
>
>


-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/


Re: Latest spark release in the 1.4 branch

2016-07-07 Thread Reynold Xin
I think last time I tried I had some trouble releasing it because the
release scripts no longer work with branch-1.4. You can build from the
branch yourself, but it might be better to upgrade to the later versions.

On Wed, Jul 6, 2016 at 11:02 PM, Niranda Perera 
wrote:

> Hi guys,
>
> May I know if you have halted development in the Spark 1.4 branch? I see
> that there is a release tag for 1.4.2 but it was never released.
>
> Can we expect a 1.4.x bug fixing release anytime soon?
>
> Best
> --
> Niranda
> @n1r44 
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>


Latest spark release in the 1.4 branch

2016-07-07 Thread Niranda Perera
Hi guys,

May I know if you have halted development in the Spark 1.4 branch? I see
that there is a release tag for 1.4.2 but it was never released.

Can we expect a 1.4.x bug fixing release anytime soon?

Best
-- 
Niranda
@n1r44 
+94-71-554-8430
https://pythagoreanscript.wordpress.com/