Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao
Hi Alex,

>From my understanding the community is shifting the effort from RDD based
APIs to Dataset/DataFrame based ones, so for me it is not so necessary to
add a new RDD based API as I mentioned before. Also for the problem of so
many partitions, I think there're many other solutions to handle it.

Of course it is just my own thought.

Thanks
Saisai

On Fri, May 20, 2016 at 1:15 PM, Alexander Pivovarov 
wrote:

> Saisai, Reynold,
>
> Thank you for your replies.
> I also think that many variation of textFile() methods might be confusing
> for users. Better to have just one good textFile() implementation.
>
> Do you think sc.textFile() should use CombineTextInputFormat instead
> of TextInputFormat?
>
> CombineTextInputFormat allows users to control number of partitions in
> RDD (control split size)
> It's useful for real workloads (e.g. 100 folders, 200,000 files, all files
> are different size, e.g. 100KB - 500MB, total 4TB)
>
> if we use current implementation of sc.textFile() it will generate RDD
> with 250,000+ partitions (one partition for each small file, several
> partitions for big files).
>
> Using CombineTextInputFormat allows us to control number of partitions and
> split size by settign mapreduce.input.fileinputformat.split.maxsize
> property. e.g. if we set it to 256MB spark will generate RDD with ~20,000
> partitions.
>
> It's better to have RDD with 20,000 partitions by 256MB than RDD with
> 250,000+ partition all different sizes from 100KB to 128MB
>
> So, I see only advantages if sc.textFile() starts using CombineTextInputFormat
> instead of TextInputFormat
>
> Alex
>
> On Thu, May 19, 2016 at 8:30 PM, Saisai Shao 
> wrote:
>
>> From my understanding I think newAPIHadoopFile or hadoopFIle is generic
>> enough for you to support any InputFormat you wanted. IMO it is not so
>> necessary to add a new API for this.
>>
>> On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Spark users might not know about CombineTextInputFormat. They probably
>>> think that sc.textFile already implements the best way to read text files.
>>>
>>> I think CombineTextInputFormat can replace regular TextInputFormat in
>>> most of the cases.
>>> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
>>> On May 19, 2016 2:43 AM, "Reynold Xin"  wrote:
>>>
 Users would be able to run this already with the 3 lines of code you
 supplied right? In general there are a lot of methods already on
 SparkContext and we lean towards the more conservative side in introducing
 new API variants.

 Note that this is something we are doing automatically in Spark SQL for
 file sources (Dataset/DataFrame).


 On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
 apivova...@gmail.com> wrote:

> Hello Everyone
>
> Do you think it would be useful to add combinedTextFile method (which
> uses CombineTextInputFormat) to SparkContext?
>
> It allows one task to read data from multiple text files and control
> number of RDD partitions by setting
> mapreduce.input.fileinputformat.split.maxsize
>
>
>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
> val conf = sc.hadoopConfiguration
> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
> classOf[LongWritable], classOf[Text], conf).
>   map(pair => pair._2.toString).setName(path)
>   }
>
>
> Alex
>


>>
>


Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
Saisai, Reynold,

Thank you for your replies.
I also think that many variation of textFile() methods might be confusing
for users. Better to have just one good textFile() implementation.

Do you think sc.textFile() should use CombineTextInputFormat instead
of TextInputFormat?

CombineTextInputFormat allows users to control number of partitions in RDD
(control split size)
It's useful for real workloads (e.g. 100 folders, 200,000 files, all files
are different size, e.g. 100KB - 500MB, total 4TB)

if we use current implementation of sc.textFile() it will generate RDD with
250,000+ partitions (one partition for each small file, several partitions
for big files).

Using CombineTextInputFormat allows us to control number of partitions and
split size by settign mapreduce.input.fileinputformat.split.maxsize
property. e.g. if we set it to 256MB spark will generate RDD with ~20,000
partitions.

It's better to have RDD with 20,000 partitions by 256MB than RDD with
250,000+ partition all different sizes from 100KB to 128MB

So, I see only advantages if sc.textFile() starts using CombineTextInputFormat
instead of TextInputFormat

Alex

On Thu, May 19, 2016 at 8:30 PM, Saisai Shao  wrote:

> From my understanding I think newAPIHadoopFile or hadoopFIle is generic
> enough for you to support any InputFormat you wanted. IMO it is not so
> necessary to add a new API for this.
>
> On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
>> Spark users might not know about CombineTextInputFormat. They probably
>> think that sc.textFile already implements the best way to read text files.
>>
>> I think CombineTextInputFormat can replace regular TextInputFormat in
>> most of the cases.
>> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
>> On May 19, 2016 2:43 AM, "Reynold Xin"  wrote:
>>
>>> Users would be able to run this already with the 3 lines of code you
>>> supplied right? In general there are a lot of methods already on
>>> SparkContext and we lean towards the more conservative side in introducing
>>> new API variants.
>>>
>>> Note that this is something we are doing automatically in Spark SQL for
>>> file sources (Dataset/DataFrame).
>>>
>>>
>>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>>> apivova...@gmail.com> wrote:
>>>
 Hello Everyone

 Do you think it would be useful to add combinedTextFile method (which
 uses CombineTextInputFormat) to SparkContext?

 It allows one task to read data from multiple text files and control
 number of RDD partitions by setting
 mapreduce.input.fileinputformat.split.maxsize


   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
 val conf = sc.hadoopConfiguration
 sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
 classOf[LongWritable], classOf[Text], conf).
   map(pair => pair._2.toString).setName(path)
   }


 Alex

>>>
>>>
>


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
Changed my vote to +1. Thanks!

2016-05-19 13:28 GMT-07:00 Xiao Li :

> Will do. Thanks!
>
> 2016-05-19 13:26 GMT-07:00 Reynold Xin :
>
>> Xiao thanks for posting. Please file a bug in JIRA. Again as I said in
>> the email this is not meant to be a functional release and will contain
>> bugs.
>>
>> On Thu, May 19, 2016 at 1:20 PM, Xiao Li  wrote:
>>
>>> -1
>>>
>>> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext
>>> and SparkSession. Both failed. It always uses in-memory catalog. Anybody
>>> else hit the same issue?
>>>
>>>
>>> Method 1: SparkSession
>>>
>>> >>> from pyspark.sql import SparkSession
>>>
>>> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>>>
>>> >>>
>>>
>>> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>>>
>>> DataFrame[]
>>>
>>> >>> spark.sql("LOAD DATA LOCAL INPATH
>>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
>>> line 494, in sql
>>>
>>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>>> line 933, in __call__
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
>>> line 57, in deco
>>>
>>> return f(*a, **kw)
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>>> line 312, in get_return_value
>>>
>>> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>>>
>>> : java.lang.UnsupportedOperationException: loadTable is not implemented
>>>
>>> at
>>> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>>>
>>> at
>>> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>>>
>>> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>>>
>>> at
>>> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>>>
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>>>
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>>>
>>> at
>>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>>
>>> at
>>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>>
>>> at
>>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>>>
>>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>>>
>>> at
>>> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>>>
>>> at
>>> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>>>
>>> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>>>
>>> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>>>
>>> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>>>
>>> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
>>>
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>
>>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>>>
>>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>>
>>> at py4j.Gateway.invoke(Gateway.java:280)
>>>
>>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>>>
>>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>>
>>> at py4j.GatewayConnection.run(GatewayConnection.java:211)
>>>
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> Method 2: Using HiveContext:
>>>
>>> >>> from pyspark.sql import HiveContext
>>>
>>> >>> sqlContext = HiveContext(sc)
>>>
>>> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
>>> STRING)")
>>>
>>> DataFrame[]
>>>
>>> >>> sqlContext.sql("LOAD DATA LOCAL INPATH
>>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>>
>>> Traceback (most recent call last):
>>>
>>>   File "", line 1, in 
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py",
>>> line 346, in sql
>>>
>>> return self.sparkSession.sql(sqlQuery)
>>>
>>>   File
>>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
>>> line 494, in sql
>>>
>>> 

Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Kai,
You can simply ignore this test failure before it is fixed
> On May 20, 2016, at 12:54, Sun Rui  wrote:
> 
> Yes. I also met this issue. It is likely related to recent R versions.
> Could you help to submit a JIRA issue? I will take a look at it
>> On May 20, 2016, at 11:13, Kai Jiang > > wrote:
>> 
>> I was trying to build SparkR this week. hmm~ But, I encountered problem with 
>> SparkR unit testing. That is probably similar as Gayathri encountered with.
>> I tried many times with running ./R/run-tests.sh script. It seems like every 
>> time the test will be failed.
>> 
>> Here are some environments when I was building:
>> java 7
>> R 3.30(sudo apt-get install r-base-devunder   ubuntu 15.04)
>> set SPARK_HOME=/path
>> 
>> R -e 'install.packages("testthat", repos="http://cran.us.r-project.org 
>> ")'
>> build with:   build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
>> -Psparkr -DskipTests -T 1C clean package
>> 
>> ./R/install-dev.sh
>> ./R/run-tests.sh
>> 
>> Here is the error message I got: 
>> https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2 
>> 
>> 
>> I guess this issue related to permission. It seems I used `sudo 
>> ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we 
>> couldn't access /tmp directory.  However, the SparkR unit testing is brittle.
>> 
>> Could someone give any hints of how to solve this?
>> 
>> Best,
>> Kai.
>> 
>> On Thu, May 19, 2016 at 6:59 PM, Sun Rui > > wrote:
>> You must specify -Psparkr when building from source.
>> 
>>> On May 20, 2016, at 08:09, Gayathri Murali >> > wrote:
>>> 
>>> That helped! Thanks. I am building from source code and I am not sure what 
>>> caused the issue with SparkR.
>>> 
>>> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng >> > wrote:
>>> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the 
>>> latest branch-2.0, there could be an issue with your SparkR installation. 
>>> Did you try `R/install-dev.sh`?
>>> 
>>> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali 
>>> > wrote:
>>> This is on Spark 2.0. I see the following on the unit-tests.log when I run 
>>> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased 
>>> master. R version is 3.3.0.
>>> 
>>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: 
>>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>>> 1384595 org.apache.spark.SparkException: R computation failed with
>>> 1384596
>>> 1384597 Execution halted
>>> 1384598
>>> 1384599 Execution halted
>>> 1384600
>>> 1384601 Execution halted
>>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>>> 1384604 at 
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>> 1384606 at 
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> 1384608 at 
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> 1384609 at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 1384610 at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 1384611 at java.lang.Thread.run(Thread.java:745)
>>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped 
>>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped 
>>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI 
>>> at http://localhost:4040 
>>> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR Executor: 
>>> Exception in task 1.0 in stage 5186.0 (TID 10371)
>>> 1384616 org.apache.spark.SparkException: R computation failed with
>>> 1384617
>>> 1384618 Execution halted
>>> 1384619
>>> 1384620 Execution halted
>>> 1384621
>>> 1384622 Execution halted
>>> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>>> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>>> 1384625 at 
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
>>> t 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Hyukjin Kwon
I happened to test SparkR in Windows 7 (32 bits) and it seems some tests
are failed.

Could this be a reason to downvote?

For more details of the tests, please see
https://github.com/apache/spark/pull/13165#issuecomment-220515182



2016-05-20 13:44 GMT+09:00 Takuya UESHIN :

> Hi all,
>
> I'm sorry, I misunderstood the purpose of this vote.
>
> I change to +1.
>
> Thanks.
>
>
>
>
> 2016-05-20 12:05 GMT+09:00 Takuya UESHIN :
>
>> -1 (non-binding)
>>
>> I filed 2 major bugs of Spark SQL:
>>
>> SPARK-15308 : RowEncoder
>> should preserve nested column name.
>> SPARK-15313 : 
>> EmbedSerializerInFilter
>> rule should keep exprIds of output of surrounded SerializeFromObject.
>>
>> I've sent PRs for those, please check them.
>>
>> Thanks.
>>
>>
>>
>>
>> 2016-05-18 14:40 GMT+09:00 Reynold Xin :
>>
>>> Hi,
>>>
>>> In the past the Apache Spark community have created preview packages
>>> (not official releases) and used those as opportunities to ask community
>>> members to test the upcoming versions of Apache Spark. Several people in
>>> the Apache community have suggested we conduct votes for these preview
>>> packages and turn them into formal releases by the Apache foundation's
>>> standard. Preview releases are not meant to be functional, i.e. they can
>>> and highly likely will contain critical bugs or documentation errors, but
>>> we will be able to post them to the project's website to get wider
>>> feedback. They should satisfy the legal requirements of Apache's release
>>> policy (http://www.apache.org/dev/release.html) such as having proper
>>> licenses.
>>>
>>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 PM PDT
>>> and passes if a majority of at least 3 +1 PMC votes are cast.
>>>
>>> [ ] +1 Release this package as Apache Spark 2.0.0-preview
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is 2.0.0-preview
>>> (8f5a04b6299e3a47aca13cbb40e72344c0114860)
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
>>>
>>> Release artifacts are signed with the following key:
>>> https://people.apache.org/keys/committer/pwendell.asc
>>>
>>> The documentation corresponding to this release can be found at:
>>> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
>>>
>>> The list of resolved issues are:
>>> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
>>>
>>>
>>> If you are a Spark user, you can help us test this release by taking an
>>> existing Apache Spark workload and running on this candidate, then
>>> reporting any regressions.
>>>
>>>
>>
>>
>> --
>> Takuya UESHIN
>> Tokyo, Japan
>>
>> http://twitter.com/ueshin
>>
>
>
>
> --
> Takuya UESHIN
> Tokyo, Japan
>
> http://twitter.com/ueshin
>


Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
Yes. I also met this issue. It is likely related to recent R versions.
Could you help to submit a JIRA issue? I will take a look at it
> On May 20, 2016, at 11:13, Kai Jiang  wrote:
> 
> I was trying to build SparkR this week. hmm~ But, I encountered problem with 
> SparkR unit testing. That is probably similar as Gayathri encountered with.
> I tried many times with running ./R/run-tests.sh script. It seems like every 
> time the test will be failed.
> 
> Here are some environments when I was building:
> java 7
> R 3.30(sudo apt-get install r-base-devunder   ubuntu 15.04)
> set SPARK_HOME=/path
> 
> R -e 'install.packages("testthat", repos="http://cran.us.r-project.org 
> ")'
> build with:   build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 
> -Psparkr -DskipTests -T 1C clean package
> 
> ./R/install-dev.sh
> ./R/run-tests.sh
> 
> Here is the error message I got: 
> https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2 
> 
> 
> I guess this issue related to permission. It seems I used `sudo 
> ./R/run-tests.sh` and it worked sometimes. Without permission, maybe we 
> couldn't access /tmp directory.  However, the SparkR unit testing is brittle.
> 
> Could someone give any hints of how to solve this?
> 
> Best,
> Kai.
> 
> On Thu, May 19, 2016 at 6:59 PM, Sun Rui  > wrote:
> You must specify -Psparkr when building from source.
> 
>> On May 20, 2016, at 08:09, Gayathri Murali > > wrote:
>> 
>> That helped! Thanks. I am building from source code and I am not sure what 
>> caused the issue with SparkR.
>> 
>> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng > > wrote:
>> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the 
>> latest branch-2.0, there could be an issue with your SparkR installation. 
>> Did you try `R/install-dev.sh`?
>> 
>> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali 
>> > wrote:
>> This is on Spark 2.0. I see the following on the unit-tests.log when I run 
>> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased 
>> master. R version is 3.3.0.
>> 
>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: 
>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>> 1384595 org.apache.spark.SparkException: R computation failed with
>> 1384596
>> 1384597 Execution halted
>> 1384598
>> 1384599 Execution halted
>> 1384600
>> 1384601 Execution halted
>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>> 1384604 at 
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> 1384606 at 
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> 1384608 at 
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 1384609 at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 1384610 at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 1384611 at java.lang.Thread.run(Thread.java:745)
>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped 
>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped 
>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI at 
>> http://localhost:4040 
>> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR Executor: 
>> Exception in task 1.0 in stage 5186.0 (TID 10371)
>> 1384616 org.apache.spark.SparkException: R computation failed with
>> 1384617
>> 1384618 Execution halted
>> 1384619
>> 1384620 Execution halted
>> 1384621
>> 1384622 Execution halted
>> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>> 1384625 at 
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
>> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 1384630 at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 1384631 at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 1384632 at java.lang.Thread.run(Thread.java:745)
>> 

Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Saisai Shao
>From my understanding I think newAPIHadoopFile or hadoopFIle is generic
enough for you to support any InputFormat you wanted. IMO it is not so
necessary to add a new API for this.

On Fri, May 20, 2016 at 12:59 AM, Alexander Pivovarov 
wrote:

> Spark users might not know about CombineTextInputFormat. They probably
> think that sc.textFile already implements the best way to read text files.
>
> I think CombineTextInputFormat can replace regular TextInputFormat in most
> of the cases.
> Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
> On May 19, 2016 2:43 AM, "Reynold Xin"  wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>> val conf = sc.hadoopConfiguration
>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>   map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>


Re: SparkR dataframe error

2016-05-19 Thread Kai Jiang
I was trying to build SparkR this week. hmm~ But, I encountered problem
with SparkR unit testing. That is probably similar as Gayathri encountered
with.
I tried many times with running ./R/run-tests.sh script. It seems like
every time the test will be failed.

Here are some environments when I was building:
java 7
R 3.30(sudo apt-get install r-base-devunder   ubuntu 15.04)
set SPARK_HOME=/path

R -e 'install.packages("testthat", repos="http://cran.us.r-project.org;)'
build with:   build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0
-Psparkr -DskipTests -T 1C clean package

./R/install-dev.sh
./R/run-tests.sh

Here is the error message I got:
https://gist.github.com/vectorijk/71f4ff34e3d34a628b8a3013f0ca2aa2

I guess this issue related to permission. It seems I used `sudo
./R/run-tests.sh` and it worked sometimes. Without permission, maybe we
couldn't access /tmp directory.  However, the SparkR unit testing is
brittle.

Could someone give any hints of how to solve this?

Best,
Kai.

On Thu, May 19, 2016 at 6:59 PM, Sun Rui  wrote:

> You must specify -Psparkr when building from source.
>
> On May 20, 2016, at 08:09, Gayathri Murali 
> wrote:
>
> That helped! Thanks. I am building from source code and I am not sure what
> caused the issue with SparkR.
>
> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng  wrote:
>
>> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing
>> the latest branch-2.0, there could be an issue with your SparkR
>> installation. Did you try `R/install-dev.sh`?
>>
>> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali <
>> gayathri.m.sof...@gmail.com> wrote:
>>
>>> This is on Spark 2.0. I see the following on the unit-tests.log when I
>>> run the R/run-tests.sh. This on a single MAC laptop, on the recently
>>> rebased master. R version is 3.3.0.
>>>
>>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor:
>>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>>> 1384595 org.apache.spark.SparkException: R computation failed with
>>> 1384596
>>> 1384597 Execution halted
>>> 1384598
>>> 1384599 Execution halted
>>> 1384600
>>> 1384601 Execution halted
>>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>>> 1384604 at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>> 1384606 at
>>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>>> 1384608 at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> 1384609 at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 1384610 at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 1384611 at java.lang.Thread.run(Thread.java:745)
>>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped
>>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped
>>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web
>>> UI at http://localhost:4040
>>> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR
>>> Executor: Exception in task 1.0 in stage 5186.0 (TID 10371)
>>> 1384616 org.apache.spark.SparkException: R computation failed with
>>> 1384617
>>> 1384618 Execution halted
>>> 1384619
>>> 1384620 Execution halted
>>> 1384621
>>> 1384622 Execution halted
>>> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>>> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>>> 1384625 at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>>> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>>> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
>>> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>>> 1384630 at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> 1384631 at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> 1384632 at java.lang.Thread.run(Thread.java:745)
>>> 1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler:
>>> Job 5183 failed: collect at null:-1, took 0.211674 s
>>> 1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR
>>> RBackendHandler: collect on 26345 failed
>>> 1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage
>>> 5186 (collect at null:-1) failed in 0.210 s
>>> 1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus:
>>> SparkListenerBus has already stopped! Dropping event
>>> 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Takuya UESHIN
-1 (non-binding)

I filed 2 major bugs of Spark SQL:

SPARK-15308 : RowEncoder
should preserve nested column name.
SPARK-15313 :
EmbedSerializerInFilter
rule should keep exprIds of output of surrounded SerializeFromObject.

I've sent PRs for those, please check them.

Thanks.




2016-05-18 14:40 GMT+09:00 Reynold Xin :

> Hi,
>
> In the past the Apache Spark community have created preview packages (not
> official releases) and used those as opportunities to ask community members
> to test the upcoming versions of Apache Spark. Several people in the Apache
> community have suggested we conduct votes for these preview packages and
> turn them into formal releases by the Apache foundation's standard. Preview
> releases are not meant to be functional, i.e. they can and highly likely
> will contain critical bugs or documentation errors, but we will be able to
> post them to the project's website to get wider feedback. They should
> satisfy the legal requirements of Apache's release policy (
> http://www.apache.org/dev/release.html) such as having proper licenses.
>
>
> Please vote on releasing the following candidate as Apache Spark version
> 2.0.0-preview. The vote is open until Friday, May 20, 2015 at 11:00 PM PDT
> and passes if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 2.0.0-preview
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is 2.0.0-preview
> (8f5a04b6299e3a47aca13cbb40e72344c0114860)
>
> The release files, including signatures, digests, etc. can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The documentation corresponding to this release can be found at:
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/
>
> The list of resolved issues are:
> https://issues.apache.org/jira/browse/SPARK-15351?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%202.0.0
>
>
> If you are a Spark user, you can help us test this release by taking an
> existing Apache Spark workload and running on this candidate, then
> reporting any regressions.
>
>


-- 
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin


Re: SBT doesn't pick resource file after clean

2016-05-19 Thread dhruve ashar
Based on the conversation on PR, the intent was not to pollute the source
directory and hence we are placing the generated file outside it in the
target/extra-resources directory. I agree that the "sbt way" is to add the
generated resources under the resourceManaged setting which was essentially
the earlier approach implemented.

However, even on generating the  file under the default resourceDirectory
=> core/src/resources doesn't pick the file in jar after doing a clean. So
this seems to be a different issue.





On Thu, May 19, 2016 at 4:17 PM, Jakob Odersky  wrote:

> To echo my comment on the PR: I think the "sbt way" to add extra,
> generated resources to the classpath is by adding a new task to the
> `resourceGenerators` setting. Also, the task should output any files
> into the directory specified by the `resourceManaged` setting. See
> http://www.scala-sbt.org/0.13/docs/Howto-Generating-Files.html. There
> shouldn't by any issues with clean if you follow the above
> conventions.
>
> On Tue, May 17, 2016 at 12:00 PM, Marcelo Vanzin 
> wrote:
> > Perhaps you need to make the "compile" task of the appropriate module
> > depend on the task that generates the resource file?
> >
> > Sorry but my knowledge of sbt doesn't really go too far.
> >
> > On Tue, May 17, 2016 at 11:58 AM, dhruve ashar 
> wrote:
> >> We are trying to pick the spark version automatically from pom instead
> of
> >> manually modifying the files. This also includes richer pieces of
> >> information like last commit, version, user who built the code etc to
> better
> >> identify the framework running.
> >>
> >> The setup is as follows :
> >> - A shell script generates this piece of information and dumps it into a
> >> properties file under core/target/extra-resources - we don't want to
> pollute
> >> the source directory and hence we are generating this under target as
> its
> >> dealing with build information.
> >> - The shell script is invoked in both mvn and sbt.
> >>
> >> The issue is that sbt doesn't pick up the generated properties file
> after
> >> doing a clean. But it does pick it up in subsequent runs. Note, the
> >> properties file is created before the classes are generated.
> >>
> >> The code for this is available in the PR :
> >> https://github.com/apache/spark/pull/13061
> >>
> >> Does anybody have an idea about how we can achieve this in sbt?
> >>
> >> Thanks,
> >> Dhruve
> >>
> >
> >
> >
> > --
> > Marcelo
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>



-- 
-Dhruve Ashar


Re: SparkR dataframe error

2016-05-19 Thread Sun Rui
You must specify -Psparkr when building from source.
> On May 20, 2016, at 08:09, Gayathri Murali  
> wrote:
> 
> That helped! Thanks. I am building from source code and I am not sure what 
> caused the issue with SparkR.
> 
> On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng  > wrote:
> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the 
> latest branch-2.0, there could be an issue with your SparkR installation. Did 
> you try `R/install-dev.sh`?
> 
> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali  > wrote:
> This is on Spark 2.0. I see the following on the unit-tests.log when I run 
> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased 
> master. R version is 3.3.0.
> 
> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor: Exception 
> in task 0.0 in stage 5186.0 (TID 10370)
> 1384595 org.apache.spark.SparkException: R computation failed with
> 1384596
> 1384597 Execution halted
> 1384598
> 1384599 Execution halted
> 1384600
> 1384601 Execution halted
> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384604 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384606 at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
> 1384608 at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384609 at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384610 at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384611 at java.lang.Thread.run(Thread.java:745)
> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped 
> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI at 
> http://localhost:4040 
> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR Executor: 
> Exception in task 1.0 in stage 5186.0 (TID 10371)
> 1384616 org.apache.spark.SparkException: R computation failed with
> 1384617
> 1384618 Execution halted
> 1384619
> 1384620 Execution halted
> 1384621
> 1384622 Execution halted
> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384625 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384630 at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384631 at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384632 at java.lang.Thread.run(Thread.java:745)
> 1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler: Job 
> 5183 failed: collect at null:-1, took 0.211674 s
> 1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR RBackendHandler: 
> collect on 26345 failed
> 1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage 5186 
> (collect at null:-1) failed in 0.210 s
> 1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageIn
> fo@413da307)
> 1384637 16/05/19 11:28:13.878 Thread-1 ERROR LiveListenerBus: 
> SparkListenerBus has already stopped! Dropping event 
> SparkListenerJobEnd(5183,1463682493877,JobFailed(org.apache.sp
> ark.SparkException: Job 5183 cancelled because SparkContext was shut down))
> 1384638 16/05/19 11:28:13.880 dispatcher-event-loop-1 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 1384639 16/05/19 11:28:13.904 Thread-1 INFO MemoryStore: MemoryStore cleared
> 1384640 16/05/19 11:28:13.904 Thread-1 INFO BlockManager: BlockManager stopped
> 1384641 16/05/19 11:28:13.904 Thread-1 INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 1384642 16/05/19 11:28:13.905 dispatcher-event-loop-0 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully 
> stopped SparkContext
> 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown 
> hook called
> 1384645 16/05/19 11:28:13.911 Thread-1 INFO 

Re: SparkR dataframe error

2016-05-19 Thread Gayathri Murali
That helped! Thanks. I am building from source code and I am not sure what
caused the issue with SparkR.

On Thu, May 19, 2016 at 4:17 PM, Xiangrui Meng  wrote:

> We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the
> latest branch-2.0, there could be an issue with your SparkR installation.
> Did you try `R/install-dev.sh`?
>
> On Thu, May 19, 2016 at 11:42 AM Gayathri Murali <
> gayathri.m.sof...@gmail.com> wrote:
>
>> This is on Spark 2.0. I see the following on the unit-tests.log when I
>> run the R/run-tests.sh. This on a single MAC laptop, on the recently
>> rebased master. R version is 3.3.0.
>>
>> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor:
>> Exception in task 0.0 in stage 5186.0 (TID 10370)
>> 1384595 org.apache.spark.SparkException: R computation failed with
>> 1384596
>> 1384597 Execution halted
>> 1384598
>> 1384599 Execution halted
>> 1384600
>> 1384601 Execution halted
>> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>> 1384604 at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> 1384606 at
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
>> 1384608 at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 1384609 at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 1384610 at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 1384611 at java.lang.Thread.run(Thread.java:745)
>> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped
>> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
>> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped
>> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
>> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI
>> at http://localhost:4040
>> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR
>> Executor: Exception in task 1.0 in stage 5186.0 (TID 10371)
>> 1384616 org.apache.spark.SparkException: R computation failed with
>> 1384617
>> 1384618 Execution halted
>> 1384619
>> 1384620 Execution halted
>> 1384621
>> 1384622 Execution halted
>> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
>> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
>> 1384625 at
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
>> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
>> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
>> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>> 1384630 at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 1384631 at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> 1384632 at java.lang.Thread.run(Thread.java:745)
>> 1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler:
>> Job 5183 failed: collect at null:-1, took 0.211674 s
>> 1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR
>> RBackendHandler: collect on 26345 failed
>> 1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage
>> 5186 (collect at null:-1) failed in 0.210 s
>> 1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus:
>> SparkListenerBus has already stopped! Dropping event
>> SparkListenerStageCompleted(org.apache.spark.scheduler.StageIn
>>  fo@413da307)
>> 1384637 16/05/19 11:28:13.878 Thread-1 ERROR LiveListenerBus:
>> SparkListenerBus has already stopped! Dropping event
>> SparkListenerJobEnd(5183,1463682493877,JobFailed(org.apache.sp
>>  ark.SparkException: Job 5183 cancelled because SparkContext was shut down))
>> 1384638 16/05/19 11:28:13.880 dispatcher-event-loop-1 INFO
>> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
>> 1384639 16/05/19 11:28:13.904 Thread-1 INFO MemoryStore: MemoryStore
>> cleared
>> 1384640 16/05/19 11:28:13.904 Thread-1 INFO BlockManager: BlockManager
>> stopped
>> 1384641 16/05/19 11:28:13.904 Thread-1 INFO BlockManagerMaster:
>> BlockManagerMaster stopped
>> 1384642 16/05/19 11:28:13.905 dispatcher-event-loop-0 INFO
>> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
>> OutputCommitCoordinator stopped!
>> 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully
>> stopped SparkContext
>> 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown
>> hook called
>> 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting
>> directory
>> /private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915
>>1c8231
>>
>>
>> On Thu, May 19, 2016 at 8:46 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Jeff Zhang
@Xiao,

It is tracked in SPARK-15345


On Fri, May 20, 2016 at 4:20 AM, Xiao Li  wrote:

> -1
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and
> SparkSession. Both failed. It always uses in-memory catalog. Anybody else
> hit the same issue?
>
>
> Method 1: SparkSession
>
> >>> from pyspark.sql import SparkSession
>
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>
> >>>
>
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>
> DataFrame[]
>
> >>> spark.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
> line 494, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
> line 57, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>
> : java.lang.UnsupportedOperationException: loadTable is not implemented
>
> at
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:280)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Method 2: Using HiveContext:
>
> >>> from pyspark.sql import HiveContext
>
> >>> sqlContext = HiveContext(sc)
>
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")
>
> DataFrame[]
>
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py",
> line 346, in sql
>
> return self.sparkSession.sql(sqlQuery)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
> line 494, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
> line 57, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>
> : 

Re: Spark driver and yarn behavior

2016-05-19 Thread Shankar Venkataraman
Thanks Luciano. The case we are seeing is different - the yarn resource
manager is shutting down the container in which the executor is running
since there does not seem to be a response and it is deeming it dead. It
started another container but the driver seems to be oblivious for nearly 2
hours. Am wondering if there is a condition where the driver is not seeing
the notification from the Yarn RM about the executor container going away.
We will try some of the settings you pointed to, and see if alleviates the
issue.

Shankar

On Thu, 19 May 2016 at 16:20 Luciano Resende  wrote:

>
> On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
> shankarvenkataraman...@gmail.com> wrote:
>
>> Hi!
>>
>> We are running into an interesting behavior with the Spark driver. We
>> Spark running under Yarn. The spark driver seems to be sending work to a
>> dead executor for 3 hours before it recognizes it. The workload seems to
>> have been processed by other executors just fine and we see no loss in
>> overall through put. This Jira -
>> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
>> similar behavior.
>>
>> The yarn resource manager log indicates the following:
>>
>> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
>> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
>> Timed out after 600 secs
>> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
>> (RMNodeImpl.java:transition(746)) - Deactivating Node 
>> dn-a01.example.org:45454 as it is now LOST
>>
>> The Executor is not reachable for 10 minutes according to this log
>> message but the Excutor's log shows plenty of RDD processing during that
>> time frame.
>> This seems like a pretty big issue because the orphan executor seems to
>> cause a memory leak in the Driver and the Driver becomes non-respondent due
>> to heavy Full GC.
>>
>> Has anyone else run into a similar situation?
>>
>> Thanks for any and all feedback / suggestions.
>>
>> Shankar
>>
>>
> I am not sure if this is exactly the same issue, but while we were doing
> heavy processing of large history of tweet data via streaming, we were
> having similar issues due to the load on the executors, and we bumped some
> configurations to avoid loosing some of these executors (even though there
> were alive, but busy to heart beat or something)
>
> Some of these are described at
>
> https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala
>
>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>


Re: Spark driver and yarn behavior

2016-05-19 Thread Luciano Resende
On Thu, May 19, 2016 at 3:16 PM, Shankar Venkataraman <
shankarvenkataraman...@gmail.com> wrote:

> Hi!
>
> We are running into an interesting behavior with the Spark driver. We
> Spark running under Yarn. The spark driver seems to be sending work to a
> dead executor for 3 hours before it recognizes it. The workload seems to
> have been processed by other executors just fine and we see no loss in
> overall through put. This Jira -
> https://issues.apache.org/jira/browse/SPARK-10586 - seems to indicate a
> similar behavior.
>
> The yarn resource manager log indicates the following:
>
> 2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor 
> (AbstractLivelinessMonitor.java:run(127)) - Expired:dn-a01.example.org:45454 
> Timed out after 600 secs
> 2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl 
> (RMNodeImpl.java:transition(746)) - Deactivating Node 
> dn-a01.example.org:45454 as it is now LOST
>
> The Executor is not reachable for 10 minutes according to this log message
> but the Excutor's log shows plenty of RDD processing during that time frame.
> This seems like a pretty big issue because the orphan executor seems to
> cause a memory leak in the Driver and the Driver becomes non-respondent due
> to heavy Full GC.
>
> Has anyone else run into a similar situation?
>
> Thanks for any and all feedback / suggestions.
>
> Shankar
>
>
I am not sure if this is exactly the same issue, but while we were doing
heavy processing of large history of tweet data via streaming, we were
having similar issues due to the load on the executors, and we bumped some
configurations to avoid loosing some of these executors (even though there
were alive, but busy to heart beat or something)

Some of these are described at
https://github.com/SparkTC/redrock/blob/master/twitter-decahose/src/main/scala/com/decahose/ApplicationContext.scala



-- 
Luciano Resende
http://twitter.com/lresende1975
http://lresende.blogspot.com/


Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
We no longer have `SparkRWrappers` in Spark 2.0. So if you are testing the
latest branch-2.0, there could be an issue with your SparkR installation.
Did you try `R/install-dev.sh`?

On Thu, May 19, 2016 at 11:42 AM Gayathri Murali <
gayathri.m.sof...@gmail.com> wrote:

> This is on Spark 2.0. I see the following on the unit-tests.log when I run
> the R/run-tests.sh. This on a single MAC laptop, on the recently rebased
> master. R version is 3.3.0.
>
> 16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor:
> Exception in task 0.0 in stage 5186.0 (TID 10370)
> 1384595 org.apache.spark.SparkException: R computation failed with
> 1384596
> 1384597 Execution halted
> 1384598
> 1384599 Execution halted
> 1384600
> 1384601 Execution halted
> 1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384604 at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384606 at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> 1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
> 1384608 at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384609 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384610 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384611 at java.lang.Thread.run(Thread.java:745)
> 1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped
> o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
> 1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped
> o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
> 1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI
> at http://localhost:4040
> 1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR
> Executor: Exception in task 1.0 in stage 5186.0 (TID 10371)
> 1384616 org.apache.spark.SparkException: R computation failed with
> 1384617
> 1384618 Execution halted
> 1384619
> 1384620 Execution halted
> 1384621
> 1384622 Execution halted
> 1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
> 1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
> 1384625 at
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
> 1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
> 1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
> t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> 1384630 at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 1384631 at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 1384632 at java.lang.Thread.run(Thread.java:745)
> 1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler: Job
> 5183 failed: collect at null:-1, took 0.211674 s
> 1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR RBackendHandler:
> collect on 26345 failed
> 1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage 5186
> (collect at null:-1) failed in 0.210 s
> 1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus:
> SparkListenerBus has already stopped! Dropping event
> SparkListenerStageCompleted(org.apache.spark.scheduler.StageIn
>  fo@413da307)
> 1384637 16/05/19 11:28:13.878 Thread-1 ERROR LiveListenerBus:
> SparkListenerBus has already stopped! Dropping event
> SparkListenerJobEnd(5183,1463682493877,JobFailed(org.apache.sp
>  ark.SparkException: Job 5183 cancelled because SparkContext was shut down))
> 1384638 16/05/19 11:28:13.880 dispatcher-event-loop-1 INFO
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 1384639 16/05/19 11:28:13.904 Thread-1 INFO MemoryStore: MemoryStore
> cleared
> 1384640 16/05/19 11:28:13.904 Thread-1 INFO BlockManager: BlockManager
> stopped
> 1384641 16/05/19 11:28:13.904 Thread-1 INFO BlockManagerMaster:
> BlockManagerMaster stopped
> 1384642 16/05/19 11:28:13.905 dispatcher-event-loop-0 INFO
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
> OutputCommitCoordinator stopped!
> 1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully
> stopped SparkContext
> 1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown
> hook called
> 1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting
> directory
> /private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915
>1c8231
>
>
> On Thu, May 19, 2016 at 8:46 AM, Xiangrui Meng  wrote:
>
>> Is it on 1.6.x?
>>
>> On Wed, May 18, 2016, 6:57 PM Sun Rui  wrote:
>>
>>> I saw it, but I can’t see the complete error message on it.
>>> I mean the part after “error in invokingJava(…)”
>>>
>>> On May 19, 2016, 

Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Reynold Xin
The old one is deprecated but should still work though.


On Thu, May 19, 2016 at 3:51 PM, Arun Allamsetty 
wrote:

> Hi Doug,
>
> If you look at the API docs here:
> http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.hive.HiveContext,
> you'll see
> Deprecate* (Since version 2.0.0)* Use
> SparkSession.builder.enableHiveSupport instead
> So you probably need to use that.
>
> Arun
>
> On Thu, May 19, 2016 at 3:44 PM, Michael Armbrust 
> wrote:
>
>> 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)”
>>> doesn’t work because “HiveContext not a member of
>>> org.apache.spark.sql.hive”  I checked the documentation, and it looks like
>>> it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz
>>>
>>
>> HiveContext has been deprecated and moved to a 1.x compatibility package,
>> which you'll need to include explicitly.  Docs have not been updated yet.
>>
>>
>>> 2. I also tried the new spark session, ‘spark.table(“db.table”)’, it
>>> fails with a HDFS permission denied can’t write to “/user/hive/warehouse”
>>>
>>
>> Where are the HDFS configurations located?  We might not be propagating
>> them correctly any more.
>>
>
>


Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Arun Allamsetty
Hi Doug,

If you look at the API docs here:
http://home.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-docs/api/scala/index.html#org.apache.spark.sql.hive.HiveContext,
you'll see
Deprecate* (Since version 2.0.0)* Use
SparkSession.builder.enableHiveSupport instead
So you probably need to use that.

Arun

On Thu, May 19, 2016 at 3:44 PM, Michael Armbrust 
wrote:

> 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)”
>> doesn’t work because “HiveContext not a member of
>> org.apache.spark.sql.hive”  I checked the documentation, and it looks like
>> it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz
>>
>
> HiveContext has been deprecated and moved to a 1.x compatibility package,
> which you'll need to include explicitly.  Docs have not been updated yet.
>
>
>> 2. I also tried the new spark session, ‘spark.table(“db.table”)’, it
>> fails with a HDFS permission denied can’t write to “/user/hive/warehouse”
>>
>
> Where are the HDFS configurations located?  We might not be propagating
> them correctly any more.
>


Spark driver and yarn behavior

2016-05-19 Thread Shankar Venkataraman
Hi!

We are running into an interesting behavior with the Spark driver. We Spark
running under Yarn. The spark driver seems to be sending work to a dead
executor for 3 hours before it recognizes it. The workload seems to have
been processed by other executors just fine and we see no loss in overall
through put. This Jira - https://issues.apache.org/jira/browse/SPARK-10586 -
seems to indicate a similar behavior.

The yarn resource manager log indicates the following:

2016-05-02 21:36:40,081 INFO  util.AbstractLivelinessMonitor
(AbstractLivelinessMonitor.java:run(127)) -
Expired:dn-a01.example.org:45454 Timed out after 600 secs
2016-05-02 21:36:40,082 INFO  rmnode.RMNodeImpl
(RMNodeImpl.java:transition(746)) - Deactivating Node
dn-a01.example.org:45454 as it is now LOST

The Executor is not reachable for 10 minutes according to this log message
but the Excutor's log shows plenty of RDD processing during that time frame.
This seems like a pretty big issue because the orphan executor seems to
cause a memory leak in the Driver and the Driver becomes non-respondent due
to heavy Full GC.

Has anyone else run into a similar situation?

Thanks for any and all feedback / suggestions.

Shankar


Re: Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Michael Armbrust
>
> 1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)”
> doesn’t work because “HiveContext not a member of
> org.apache.spark.sql.hive”  I checked the documentation, and it looks like
> it should still work for spark-2.0.0-preview-bin-hadoop2.7.tgz
>

HiveContext has been deprecated and moved to a 1.x compatibility package,
which you'll need to include explicitly.  Docs have not been updated yet.


> 2. I also tried the new spark session, ‘spark.table(“db.table”)’, it fails
> with a HDFS permission denied can’t write to “/user/hive/warehouse”
>

Where are the HDFS configurations located?  We might not be propagating
them correctly any more.


Re: SBT doesn't pick resource file after clean

2016-05-19 Thread Jakob Odersky
To echo my comment on the PR: I think the "sbt way" to add extra,
generated resources to the classpath is by adding a new task to the
`resourceGenerators` setting. Also, the task should output any files
into the directory specified by the `resourceManaged` setting. See
http://www.scala-sbt.org/0.13/docs/Howto-Generating-Files.html. There
shouldn't by any issues with clean if you follow the above
conventions.

On Tue, May 17, 2016 at 12:00 PM, Marcelo Vanzin  wrote:
> Perhaps you need to make the "compile" task of the appropriate module
> depend on the task that generates the resource file?
>
> Sorry but my knowledge of sbt doesn't really go too far.
>
> On Tue, May 17, 2016 at 11:58 AM, dhruve ashar  wrote:
>> We are trying to pick the spark version automatically from pom instead of
>> manually modifying the files. This also includes richer pieces of
>> information like last commit, version, user who built the code etc to better
>> identify the framework running.
>>
>> The setup is as follows :
>> - A shell script generates this piece of information and dumps it into a
>> properties file under core/target/extra-resources - we don't want to pollute
>> the source directory and hence we are generating this under target as its
>> dealing with build information.
>> - The shell script is invoked in both mvn and sbt.
>>
>> The issue is that sbt doesn't pick up the generated properties file after
>> doing a clean. But it does pick it up in subsequent runs. Note, the
>> properties file is created before the classes are generated.
>>
>> The code for this is available in the PR :
>> https://github.com/apache/spark/pull/13061
>>
>> Does anybody have an idea about how we can achieve this in sbt?
>>
>> Thanks,
>> Dhruve
>>
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Possible Hive problem with Spark 2.0.0 preview.

2016-05-19 Thread Doug Balog
I haven’t had time to really look into this problem, but I want to mention it. 
I downloaded 
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-preview-bin/spark-2.0.0-preview-bin-hadoop2.7.tgz
and tried to run it against our Secure Hadoop cluster and access a Hive table.

1. “val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)”  doesn’t 
work because “HiveContext not a member of org.apache.spark.sql.hive”  I checked 
the documentation, and it looks like it should still work for 
spark-2.0.0-preview-bin-hadoop2.7.tgz

2. I also tried the new spark session, ‘spark.table(“db.table”)’, it fails with 
a HDFS permission denied can’t write to “/user/hive/warehouse”

Is there a new config option that I missed ? 

I tried a  SNAPSHOT version, downloaded from Patricks apache’s dir  from Apr 
26th,  that worked the way I expected.
I’m going to go through the commits and see which one broke the change, but my 
builds are not running (no such method ConcurrentHashMap.keySet()) so I have to 
fix that problem first.

Thanks for any hints. 

Doug



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread vishnu prasad
+1

On 20 May 2016 at 01:19, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> +1
>
>
> 2016-05-19 18:20 GMT+02:00 Xiangrui Meng :
>
>> +1
>>
>> On Thu, May 19, 2016 at 9:18 AM Joseph Bradley 
>> wrote:
>>
>>> +1
>>>
>>> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin 
>>> wrote:
>>>
 Hi Ovidiu-Cristian ,

 The best source of truth is change the filter with target version to
 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
 get closer to 2.0 release, more will be retargeted at 2.1.0.



 On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
 ovidiu-cristian.ma...@inria.fr> wrote:

> Yes, I can filter..
> Did that and for example:
>
> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
> 
>
> To rephrase: for 2.0 do you have specific issues that are not a
> priority and will released maybe with 2.1 for example?
>
> Keep up the good work!
>
> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>
> You can find that by changing the filter to target version = 2.0.0.
> Cheers.
>
> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> +1 Great, I see the list of resolved issues, do you have a list of
>> known issue you plan to stay with this release?
>>
>> with
>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
>> -Phive-thriftserver -DskipTests clean package
>>
>> mvn -version
>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>> 2015-11-10T17:41:47+01:00)
>> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
>> Java version: 1.7.0_80, vendor: Oracle Corporation
>> Java home:
>> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
>> Default locale: en_US, platform encoding: UTF-8
>> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>>
>> [INFO] Reactor Summary:
>> [INFO]
>> [INFO] Spark Project Parent POM ... SUCCESS
>> [  2.635 s]
>> [INFO] Spark Project Tags . SUCCESS
>> [  1.896 s]
>> [INFO] Spark Project Sketch ... SUCCESS
>> [  2.560 s]
>> [INFO] Spark Project Networking ... SUCCESS
>> [  6.533 s]
>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS
>> [  4.176 s]
>> [INFO] Spark Project Unsafe ... SUCCESS
>> [  4.809 s]
>> [INFO] Spark Project Launcher . SUCCESS
>> [  6.242 s]
>> [INFO] Spark Project Core . SUCCESS
>> [01:20 min]
>> [INFO] Spark Project GraphX ... SUCCESS
>> [  9.148 s]
>> [INFO] Spark Project Streaming  SUCCESS [
>> 22.760 s]
>> [INFO] Spark Project Catalyst . SUCCESS [
>> 50.783 s]
>> [INFO] Spark Project SQL .. SUCCESS
>> [01:05 min]
>> [INFO] Spark Project ML Local Library . SUCCESS
>> [  4.281 s]
>> [INFO] Spark Project ML Library ... SUCCESS [
>> 54.537 s]
>> [INFO] Spark Project Tools  SUCCESS
>> [  0.747 s]
>> [INFO] Spark Project Hive . SUCCESS [
>> 33.032 s]
>> [INFO] Spark Project HiveContext Compatibility  SUCCESS
>> [  3.198 s]
>> [INFO] Spark Project REPL . SUCCESS
>> [  3.573 s]
>> [INFO] Spark Project YARN Shuffle Service . SUCCESS
>> [  4.617 s]
>> [INFO] Spark Project YARN . SUCCESS
>> [  7.321 s]
>> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
>> 16.496 s]
>> [INFO] Spark Project Assembly . SUCCESS
>> [  2.300 s]
>> [INFO] Spark Project External Flume Sink .. SUCCESS
>> [  4.219 s]
>> [INFO] Spark Project External Flume ... SUCCESS
>> [  6.987 s]
>> [INFO] Spark Project External Flume Assembly .. SUCCESS
>> [  1.465 s]
>> [INFO] Spark Integration for Kafka 0.8  SUCCESS
>> [  6.891 s]
>> [INFO] Spark Project Examples . SUCCESS [
>> 13.465 s]
>> [INFO] Spark Project External Kafka Assembly .. SUCCESS
>> 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
Will do. Thanks!

2016-05-19 13:26 GMT-07:00 Reynold Xin :

> Xiao thanks for posting. Please file a bug in JIRA. Again as I said in the
> email this is not meant to be a functional release and will contain bugs.
>
> On Thu, May 19, 2016 at 1:20 PM, Xiao Li  wrote:
>
>> -1
>>
>> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext
>> and SparkSession. Both failed. It always uses in-memory catalog. Anybody
>> else hit the same issue?
>>
>>
>> Method 1: SparkSession
>>
>> >>> from pyspark.sql import SparkSession
>>
>> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>>
>> >>>
>>
>> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>>
>> DataFrame[]
>>
>> >>> spark.sql("LOAD DATA LOCAL INPATH
>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>
>> Traceback (most recent call last):
>>
>>   File "", line 1, in 
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
>> line 494, in sql
>>
>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>> line 933, in __call__
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
>> line 57, in deco
>>
>> return f(*a, **kw)
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>> line 312, in get_return_value
>>
>> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>>
>> : java.lang.UnsupportedOperationException: loadTable is not implemented
>>
>> at
>> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>>
>> at
>> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>>
>> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>>
>> at
>> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>>
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>>
>> at
>> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>>
>> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>>
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>>
>> at
>> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>>
>> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>>
>> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>>
>> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>>
>> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>>
>> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>>
>> at py4j.Gateway.invoke(Gateway.java:280)
>>
>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>>
>> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>>
>> at py4j.GatewayConnection.run(GatewayConnection.java:211)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>> Method 2: Using HiveContext:
>>
>> >>> from pyspark.sql import HiveContext
>>
>> >>> sqlContext = HiveContext(sc)
>>
>> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
>> STRING)")
>>
>> DataFrame[]
>>
>> >>> sqlContext.sql("LOAD DATA LOCAL INPATH
>> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>>
>> Traceback (most recent call last):
>>
>>   File "", line 1, in 
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py",
>> line 346, in sql
>>
>> return self.sparkSession.sql(sqlQuery)
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
>> line 494, in sql
>>
>> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>>
>>   File
>> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>> line 933, in __call__
>>
>>   File
>> 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Reynold Xin
Xiao thanks for posting. Please file a bug in JIRA. Again as I said in the
email this is not meant to be a functional release and will contain bugs.

On Thu, May 19, 2016 at 1:20 PM, Xiao Li  wrote:

> -1
>
> Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and
> SparkSession. Both failed. It always uses in-memory catalog. Anybody else
> hit the same issue?
>
>
> Method 1: SparkSession
>
> >>> from pyspark.sql import SparkSession
>
> >>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()
>
> >>>
>
> >>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>
> DataFrame[]
>
> >>> spark.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
> line 494, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
> line 57, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.
>
> : java.lang.UnsupportedOperationException: loadTable is not implemented
>
> at
> org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)
>
> at
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)
>
> at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
>
> at
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
>
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
>
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>
> at
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
>
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
>
> at
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:187)
>
> at org.apache.spark.sql.Dataset.(Dataset.scala:168)
>
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)
>
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>
> at py4j.Gateway.invoke(Gateway.java:280)
>
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
>
> at py4j.GatewayConnection.run(GatewayConnection.java:211)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
> Method 2: Using HiveContext:
>
> >>> from pyspark.sql import HiveContext
>
> >>> sqlContext = HiveContext(sc)
>
> >>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value
> STRING)")
>
> DataFrame[]
>
> >>> sqlContext.sql("LOAD DATA LOCAL INPATH
> 'examples/src/main/resources/kv1.txt' INTO TABLE src")
>
> Traceback (most recent call last):
>
>   File "", line 1, in 
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py",
> line 346, in sql
>
> return self.sparkSession.sql(sqlQuery)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
> line 494, in sql
>
> return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
> line 933, in __call__
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
> line 57, in deco
>
> return f(*a, **kw)
>
>   File
> "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
> line 312, in get_return_value
>
> py4j.protocol.Py4JJavaError: An 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Xiao Li
-1

Unable to use Hive meta-store in pyspark shell. Tried both HiveContext and
SparkSession. Both failed. It always uses in-memory catalog. Anybody else
hit the same issue?


Method 1: SparkSession

>>> from pyspark.sql import SparkSession

>>> spark = SparkSession.builder.enableHiveSupport().getOrCreate()

>>>

>>> spark.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

DataFrame[]

>>> spark.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt'
INTO TABLE src")

Traceback (most recent call last):

  File "", line 1, in 

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
line 494, in sql

return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in __call__

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
line 57, in deco

return f(*a, **kw)

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.

: java.lang.UnsupportedOperationException: loadTable is not implemented

at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)

at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)

at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)

at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)

at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)

at
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)

at
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)

at org.apache.spark.sql.Dataset.(Dataset.scala:187)

at org.apache.spark.sql.Dataset.(Dataset.scala:168)

at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:63)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:541)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)

at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)

at py4j.Gateway.invoke(Gateway.java:280)

at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)

at py4j.commands.CallCommand.execute(CallCommand.java:79)

at py4j.GatewayConnection.run(GatewayConnection.java:211)

at java.lang.Thread.run(Thread.java:745)


Method 2: Using HiveContext:

>>> from pyspark.sql import HiveContext

>>> sqlContext = HiveContext(sc)

>>> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")

DataFrame[]

>>> sqlContext.sql("LOAD DATA LOCAL INPATH
'examples/src/main/resources/kv1.txt' INTO TABLE src")

Traceback (most recent call last):

  File "", line 1, in 

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/context.py",
line 346, in sql

return self.sparkSession.sql(sqlQuery)

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/session.py",
line 494, in sql

return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
line 933, in __call__

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py",
line 57, in deco

return f(*a, **kw)

  File
"/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
line 312, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling o21.sql.

: java.lang.UnsupportedOperationException: loadTable is not implemented

at
org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.loadTable(InMemoryCatalog.scala:297)

at
org.apache.spark.sql.catalyst.catalog.SessionCatalog.loadTable(SessionCatalog.scala:280)

at org.apache.spark.sql.execution.command.LoadData.run(tables.scala:263)

at
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)

at

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Herman van Hövell tot Westerflier
+1


2016-05-19 18:20 GMT+02:00 Xiangrui Meng :

> +1
>
> On Thu, May 19, 2016 at 9:18 AM Joseph Bradley 
> wrote:
>
>> +1
>>
>> On Wed, May 18, 2016 at 10:49 AM, Reynold Xin 
>> wrote:
>>
>>> Hi Ovidiu-Cristian ,
>>>
>>> The best source of truth is change the filter with target version to
>>> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
>>> get closer to 2.0 release, more will be retargeted at 2.1.0.
>>>
>>>
>>>
>>> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
>>> ovidiu-cristian.ma...@inria.fr> wrote:
>>>
 Yes, I can filter..
 Did that and for example:

 https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
 

 To rephrase: for 2.0 do you have specific issues that are not a
 priority and will released maybe with 2.1 for example?

 Keep up the good work!

 On 18 May 2016, at 18:19, Reynold Xin  wrote:

 You can find that by changing the filter to target version = 2.0.0.
 Cheers.

 On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
 ovidiu-cristian.ma...@inria.fr> wrote:

> +1 Great, I see the list of resolved issues, do you have a list of
> known issue you plan to stay with this release?
>
> with
> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
> -Phive-thriftserver -DskipTests clean package
>
> mvn -version
> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
> 2015-11-10T17:41:47+01:00)
> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
> Java version: 1.7.0_80, vendor: Oracle Corporation
> Java home:
> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
> Default locale: en_US, platform encoding: UTF-8
> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>
> [INFO] Reactor Summary:
> [INFO]
> [INFO] Spark Project Parent POM ... SUCCESS [
> 2.635 s]
> [INFO] Spark Project Tags . SUCCESS [
> 1.896 s]
> [INFO] Spark Project Sketch ... SUCCESS [
> 2.560 s]
> [INFO] Spark Project Networking ... SUCCESS [
> 6.533 s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
> 4.176 s]
> [INFO] Spark Project Unsafe ... SUCCESS [
> 4.809 s]
> [INFO] Spark Project Launcher . SUCCESS [
> 6.242 s]
> [INFO] Spark Project Core . SUCCESS
> [01:20 min]
> [INFO] Spark Project GraphX ... SUCCESS [
> 9.148 s]
> [INFO] Spark Project Streaming  SUCCESS [
> 22.760 s]
> [INFO] Spark Project Catalyst . SUCCESS [
> 50.783 s]
> [INFO] Spark Project SQL .. SUCCESS
> [01:05 min]
> [INFO] Spark Project ML Local Library . SUCCESS [
> 4.281 s]
> [INFO] Spark Project ML Library ... SUCCESS [
> 54.537 s]
> [INFO] Spark Project Tools  SUCCESS [
> 0.747 s]
> [INFO] Spark Project Hive . SUCCESS [
> 33.032 s]
> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
> 3.198 s]
> [INFO] Spark Project REPL . SUCCESS [
> 3.573 s]
> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
> 4.617 s]
> [INFO] Spark Project YARN . SUCCESS [
> 7.321 s]
> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
> 16.496 s]
> [INFO] Spark Project Assembly . SUCCESS [
> 2.300 s]
> [INFO] Spark Project External Flume Sink .. SUCCESS [
> 4.219 s]
> [INFO] Spark Project External Flume ... SUCCESS [
> 6.987 s]
> [INFO] Spark Project External Flume Assembly .. SUCCESS [
> 1.465 s]
> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
> 6.891 s]
> [INFO] Spark Project Examples . SUCCESS [
> 13.465 s]
> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
> 2.815 s]
> [INFO]
> 
> [INFO] BUILD SUCCESS
> [INFO]
> 
> [INFO] 

Re: SparkR dataframe error

2016-05-19 Thread Gayathri Murali
This is on Spark 2.0. I see the following on the unit-tests.log when I run
the R/run-tests.sh. This on a single MAC laptop, on the recently rebased
master. R version is 3.3.0.

16/05/19 11:28:13.863 Executor task launch worker-1 ERROR Executor:
Exception in task 0.0 in stage 5186.0 (TID 10370)
1384595 org.apache.spark.SparkException: R computation failed with
1384596
1384597 Execution halted
1384598
1384599 Execution halted
1384600
1384601 Execution halted
1384602 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
1384603 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
1384604 at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
1384605 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
1384606 at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
1384607 at org.apache.spark.scheduler.Task.run(Task.scala:85)
1384608 at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
1384609 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
1384610 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
1384611 at java.lang.Thread.run(Thread.java:745)
1384612 16/05/19 11:28:13.864 Thread-1 INFO ContextHandler: Stopped
o.s.j.s.ServletContextHandler@22f76fa8{/jobs/json,null,UNAVAILABLE}
1384613 16/05/19 11:28:13.869 Thread-1 INFO ContextHandler: Stopped
o.s.j.s.ServletContextHandler@afe0d9f{/jobs,null,UNAVAILABLE}
1384614 16/05/19 11:28:13.869 Thread-1 INFO SparkUI: Stopped Spark web UI
at http://localhost:4040
1384615 16/05/19 11:28:13.871 Executor task launch worker-4 ERROR Executor:
Exception in task 1.0 in stage 5186.0 (TID 10371)
1384616 org.apache.spark.SparkException: R computation failed with
1384617
1384618 Execution halted
1384619
1384620 Execution halted
1384621
1384622 Execution halted
1384623 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:107)
1384624 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:49)
1384625 at
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
1384626 at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
1384627 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.
t org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
1384630 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
1384631 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
1384632 at java.lang.Thread.run(Thread.java:745)
1384633 16/05/19 11:28:13.874 nioEventLoopGroup-2-1 INFO DAGScheduler: Job
5183 failed: collect at null:-1, took 0.211674 s
1384634 16/05/19 11:28:13.875 nioEventLoopGroup-2-1 ERROR RBackendHandler:
collect on 26345 failed
1384635 16/05/19 11:28:13.876 Thread-1 INFO DAGScheduler: ResultStage 5186
(collect at null:-1) failed in 0.210 s
1384636 16/05/19 11:28:13.877 Thread-1 ERROR LiveListenerBus:
SparkListenerBus has already stopped! Dropping event
SparkListenerStageCompleted(org.apache.spark.scheduler.StageIn
 fo@413da307)
1384637 16/05/19 11:28:13.878 Thread-1 ERROR LiveListenerBus:
SparkListenerBus has already stopped! Dropping event
SparkListenerJobEnd(5183,1463682493877,JobFailed(org.apache.sp
 ark.SparkException: Job 5183 cancelled because SparkContext was shut down))
1384638 16/05/19 11:28:13.880 dispatcher-event-loop-1 INFO
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
1384639 16/05/19 11:28:13.904 Thread-1 INFO MemoryStore: MemoryStore cleared
1384640 16/05/19 11:28:13.904 Thread-1 INFO BlockManager: BlockManager
stopped
1384641 16/05/19 11:28:13.904 Thread-1 INFO BlockManagerMaster:
BlockManagerMaster stopped
1384642 16/05/19 11:28:13.905 dispatcher-event-loop-0 INFO
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:
OutputCommitCoordinator stopped!
1384643 16/05/19 11:28:13.909 Thread-1 INFO SparkContext: Successfully
stopped SparkContext
1384644 16/05/19 11:28:13.910 Thread-1 INFO ShutdownHookManager: Shutdown
hook called
1384645 16/05/19 11:28:13.911 Thread-1 INFO ShutdownHookManager: Deleting
directory
/private/var/folders/xy/qc35m0y55vq83dsqzg066_c4gn/T/spark-dfafdddc-fd25-4eb4-bb1d-565915
   1c8231


On Thu, May 19, 2016 at 8:46 AM, Xiangrui Meng  wrote:

> Is it on 1.6.x?
>
> On Wed, May 18, 2016, 6:57 PM Sun Rui  wrote:
>
>> I saw it, but I can’t see the complete error message on it.
>> I mean the part after “error in invokingJava(…)”
>>
>> On May 19, 2016, at 08:37, Gayathri Murali 
>> wrote:
>>
>> There was a screenshot attached to my original email. If you did not get
>> it, attaching here again.
>>
>> On Wed, May 18, 2016 at 5:27 PM, Sun Rui  wrote:
>>
>>> It’s wrong behaviour that head(df) outputs no row
>>> Could you send a screenshot displaying whole error message?
>>>
>>> On May 19, 2016, at 08:12, Gayathri Murali 
>>> 

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Andrew Or
+1, some maintainers are hard to find

2016-05-19 9:03 GMT-07:00 Imran Rashid :

> +1 (binding) on removal of maintainers
>
> I dont' have a strong opinion yet on how to have a system for finding the
> right reviewers.  I agree it would be nice to have something to help you
> find reviewers, though I'm a little skeptical of anything automatic.
>
> On Thu, May 19, 2016 at 10:34 AM, Matei Zaharia 
> wrote:
>
>> Hi folks,
>>
>> Around 1.5 years ago, Spark added a maintainer process for reviewing API
>> and architectural changes (
>> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
>> to make sure these are seen by people who spent a lot of time on that
>> component. At the time, the worry was that changes might go unnoticed as
>> the project grows, but there were also concerns that this approach makes
>> the project harder to contribute to and less welcoming. Since implementing
>> the model, I think that a good number of developers concluded it doesn't
>> make a huge difference, so because of these concerns, it may be useful to
>> remove it. I've also heard that we should try to keep some other
>> instructions for contributors to find the "right" reviewers, so it would be
>> great to see suggestions on that. For my part, I'd personally prefer
>> something "automatic", such as easily tracking who reviewed each patch and
>> having people look at the commit history of the module they want to work
>> on, instead of a list that needs to be maintained separately.
>>
>> Matei
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Dataset reduceByKey

2016-05-19 Thread Andres Perez
Hi all,

We were in the process of porting an RDD program to one which uses
Datasets. Most things were easy to transition, but one hole in
functionality we found was the ability to reduce a Dataset by key,
something akin to PairRDDFunctions.reduceByKey. Our first attempt of adding
the functionality ourselves involved creating a KeyValueGroupedDataset and
calling reduceGroups to get the reduced Dataset.

  class RichPairDataset[K, V: ClassTag](val ds: Dataset[(K, V)]) {
def reduceByKey(func: (V, V) => V)(implicit e1: Encoder[K], e2:
Encoder[V], e3: Encoder[(K, V)]): Dataset[(K, V)] =
  ds.groupByKey(_._1).reduceGroups { (tup1, tup2) => (tup1._1,
func(tup1._2, tup2._2)) }.map { case (k, (_, v)) => (k, v) }
  }

Note that the functions passed into .reduceGroups takes in the key-value
pair. It'd be nicer to pass in a function that maps just the values, i.e.
reduceGroups(func). This would require the ability to modify the values of
the KeyValueGroupedDataset (which is returned by the .groupByKey call on a
Dataset). Such a function (e.g., KeyValuedGroupedDataset.mapValues(func: V
=> U)) does not currently exist.

The more important issue, however, is the inefficiency of .reduceGroups.
The function does not support partial aggregation (reducing map-side), and
as a result requires shuffling all the data in the Dataset. A more
efficient alternative that that we explored involved creating a Dataset
from the KeyValueGroupedDataset by creating an Aggregator and passing it as
a TypedColumn to KeyValueGroupedDataset's .agg function. Unfortunately, the
Aggregator necessitated the creation of a zero to create a valid monoid.
However, the zero is dependent on the reduce function. The zero for a
function such as addition on Ints would be different from the zero for
taking the minimum over Ints, for example. The Aggregator requires that we
not break the rule of reduce(a, zero) == a. To do this we had to create an
Aggregator with a buffer type that stores the value along with a null flag
(using Scala's nice Option syntax yielded some mysterious errors that I
haven't worked through yet, unfortunately), used by the zero element to
signal that it should not participate in the reduce function.

-Andy


Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Alexander Pivovarov
Spark users might not know about CombineTextInputFormat. They probably
think that sc.textFile already implements the best way to read text files.

I think CombineTextInputFormat can replace regular TextInputFormat in most
of the cases.
Maybe Spark 2.0 can use CombineTextInputFormat in sc.textFile ?
On May 19, 2016 2:43 AM, "Reynold Xin"  wrote:

> Users would be able to run this already with the 3 lines of code you
> supplied right? In general there are a lot of methods already on
> SparkContext and we lean towards the more conservative side in introducing
> new API variants.
>
> Note that this is something we are doing automatically in Spark SQL for
> file sources (Dataset/DataFrame).
>
>
> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov  > wrote:
>
>> Hello Everyone
>>
>> Do you think it would be useful to add combinedTextFile method (which
>> uses CombineTextInputFormat) to SparkContext?
>>
>> It allows one task to read data from multiple text files and control
>> number of RDD partitions by setting
>> mapreduce.input.fileinputformat.split.maxsize
>>
>>
>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>> val conf = sc.hadoopConfiguration
>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>> classOf[LongWritable], classOf[Text], conf).
>>   map(pair => pair._2.toString).setName(path)
>>   }
>>
>>
>> Alex
>>
>
>


Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Joseph Bradley
+1

On Wed, May 18, 2016 at 10:49 AM, Reynold Xin  wrote:

> Hi Ovidiu-Cristian ,
>
> The best source of truth is change the filter with target version to
> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
> get closer to 2.0 release, more will be retargeted at 2.1.0.
>
>
>
> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Yes, I can filter..
>> Did that and for example:
>>
>> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
>> 
>>
>> To rephrase: for 2.0 do you have specific issues that are not a priority
>> and will released maybe with 2.1 for example?
>>
>> Keep up the good work!
>>
>> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>>
>> You can find that by changing the filter to target version = 2.0.0.
>> Cheers.
>>
>> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> +1 Great, I see the list of resolved issues, do you have a list of known
>>> issue you plan to stay with this release?
>>>
>>> with
>>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
>>> -Phive-thriftserver -DskipTests clean package
>>>
>>> mvn -version
>>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>>> 2015-11-10T17:41:47+01:00)
>>> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
>>> Java version: 1.7.0_80, vendor: Oracle Corporation
>>> Java home:
>>> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
>>> Default locale: en_US, platform encoding: UTF-8
>>> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>>>
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>> 2.635 s]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 1.896 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 2.560 s]
>>> [INFO] Spark Project Networking ... SUCCESS [
>>> 6.533 s]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>> 4.176 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>> 4.809 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 6.242 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [01:20 min]
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 9.148 s]
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 22.760 s]
>>> [INFO] Spark Project Catalyst . SUCCESS [
>>> 50.783 s]
>>> [INFO] Spark Project SQL .. SUCCESS
>>> [01:05 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 4.281 s]
>>> [INFO] Spark Project ML Library ... SUCCESS [
>>> 54.537 s]
>>> [INFO] Spark Project Tools  SUCCESS [
>>> 0.747 s]
>>> [INFO] Spark Project Hive . SUCCESS [
>>> 33.032 s]
>>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>>> 3.198 s]
>>> [INFO] Spark Project REPL . SUCCESS [
>>> 3.573 s]
>>> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
>>> 4.617 s]
>>> [INFO] Spark Project YARN . SUCCESS [
>>> 7.321 s]
>>> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
>>> 16.496 s]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>> 2.300 s]
>>> [INFO] Spark Project External Flume Sink .. SUCCESS [
>>> 4.219 s]
>>> [INFO] Spark Project External Flume ... SUCCESS [
>>> 6.987 s]
>>> [INFO] Spark Project External Flume Assembly .. SUCCESS [
>>> 1.465 s]
>>> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
>>> 6.891 s]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 13.465 s]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 2.815 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 07:04 min
>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00
>>> [INFO] Final Memory: 90M/824M
>>> [INFO]
>>> 
>>>
>>> On 18 May 2016, at 16:28, Sean Owen  wrote:
>>>
>>> I think it's a good idea. Although releases have been preceded before
>>> by release candidates for developers, it would be 

Re: [vote] Apache Spark 2.0.0-preview release (rc1)

2016-05-19 Thread Yin Huai
+1

On Wed, May 18, 2016 at 10:49 AM, Reynold Xin  wrote:

> Hi Ovidiu-Cristian ,
>
> The best source of truth is change the filter with target version to
> 2.1.0. Not a lot of tickets have been targeted yet, but I'd imagine as we
> get closer to 2.0 release, more will be retargeted at 2.1.0.
>
>
>
> On Wed, May 18, 2016 at 10:43 AM, Ovidiu-Cristian MARCU <
> ovidiu-cristian.ma...@inria.fr> wrote:
>
>> Yes, I can filter..
>> Did that and for example:
>>
>> https://issues.apache.org/jira/browse/SPARK-15370?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20%3D%202.0.0
>> 
>>
>> To rephrase: for 2.0 do you have specific issues that are not a priority
>> and will released maybe with 2.1 for example?
>>
>> Keep up the good work!
>>
>> On 18 May 2016, at 18:19, Reynold Xin  wrote:
>>
>> You can find that by changing the filter to target version = 2.0.0.
>> Cheers.
>>
>> On Wed, May 18, 2016 at 9:00 AM, Ovidiu-Cristian MARCU <
>> ovidiu-cristian.ma...@inria.fr> wrote:
>>
>>> +1 Great, I see the list of resolved issues, do you have a list of known
>>> issue you plan to stay with this release?
>>>
>>> with
>>> build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Phive
>>> -Phive-thriftserver -DskipTests clean package
>>>
>>> mvn -version
>>> Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5;
>>> 2015-11-10T17:41:47+01:00)
>>> Maven home: /Users/omarcu/tools/apache-maven-3.3.9
>>> Java version: 1.7.0_80, vendor: Oracle Corporation
>>> Java home:
>>> /Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home/jre
>>> Default locale: en_US, platform encoding: UTF-8
>>> OS name: "mac os x", version: "10.11.5", arch: "x86_64", family: “mac"
>>>
>>> [INFO] Reactor Summary:
>>> [INFO]
>>> [INFO] Spark Project Parent POM ... SUCCESS [
>>> 2.635 s]
>>> [INFO] Spark Project Tags . SUCCESS [
>>> 1.896 s]
>>> [INFO] Spark Project Sketch ... SUCCESS [
>>> 2.560 s]
>>> [INFO] Spark Project Networking ... SUCCESS [
>>> 6.533 s]
>>> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [
>>> 4.176 s]
>>> [INFO] Spark Project Unsafe ... SUCCESS [
>>> 4.809 s]
>>> [INFO] Spark Project Launcher . SUCCESS [
>>> 6.242 s]
>>> [INFO] Spark Project Core . SUCCESS
>>> [01:20 min]
>>> [INFO] Spark Project GraphX ... SUCCESS [
>>> 9.148 s]
>>> [INFO] Spark Project Streaming  SUCCESS [
>>> 22.760 s]
>>> [INFO] Spark Project Catalyst . SUCCESS [
>>> 50.783 s]
>>> [INFO] Spark Project SQL .. SUCCESS
>>> [01:05 min]
>>> [INFO] Spark Project ML Local Library . SUCCESS [
>>> 4.281 s]
>>> [INFO] Spark Project ML Library ... SUCCESS [
>>> 54.537 s]
>>> [INFO] Spark Project Tools  SUCCESS [
>>> 0.747 s]
>>> [INFO] Spark Project Hive . SUCCESS [
>>> 33.032 s]
>>> [INFO] Spark Project HiveContext Compatibility  SUCCESS [
>>> 3.198 s]
>>> [INFO] Spark Project REPL . SUCCESS [
>>> 3.573 s]
>>> [INFO] Spark Project YARN Shuffle Service . SUCCESS [
>>> 4.617 s]
>>> [INFO] Spark Project YARN . SUCCESS [
>>> 7.321 s]
>>> [INFO] Spark Project Hive Thrift Server ... SUCCESS [
>>> 16.496 s]
>>> [INFO] Spark Project Assembly . SUCCESS [
>>> 2.300 s]
>>> [INFO] Spark Project External Flume Sink .. SUCCESS [
>>> 4.219 s]
>>> [INFO] Spark Project External Flume ... SUCCESS [
>>> 6.987 s]
>>> [INFO] Spark Project External Flume Assembly .. SUCCESS [
>>> 1.465 s]
>>> [INFO] Spark Integration for Kafka 0.8  SUCCESS [
>>> 6.891 s]
>>> [INFO] Spark Project Examples . SUCCESS [
>>> 13.465 s]
>>> [INFO] Spark Project External Kafka Assembly .. SUCCESS [
>>> 2.815 s]
>>> [INFO]
>>> 
>>> [INFO] BUILD SUCCESS
>>> [INFO]
>>> 
>>> [INFO] Total time: 07:04 min
>>> [INFO] Finished at: 2016-05-18T17:55:33+02:00
>>> [INFO] Final Memory: 90M/824M
>>> [INFO]
>>> 
>>>
>>> On 18 May 2016, at 16:28, Sean Owen  wrote:
>>>
>>> I think it's a good idea. Although releases have been preceded before
>>> by release candidates for developers, it would be 

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Imran Rashid
+1 (binding) on removal of maintainers

I dont' have a strong opinion yet on how to have a system for finding the
right reviewers.  I agree it would be nice to have something to help you
find reviewers, though I'm a little skeptical of anything automatic.

On Thu, May 19, 2016 at 10:34 AM, Matei Zaharia 
wrote:

> Hi folks,
>
> Around 1.5 years ago, Spark added a maintainer process for reviewing API
> and architectural changes (
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
> to make sure these are seen by people who spent a lot of time on that
> component. At the time, the worry was that changes might go unnoticed as
> the project grows, but there were also concerns that this approach makes
> the project harder to contribute to and less welcoming. Since implementing
> the model, I think that a good number of developers concluded it doesn't
> make a huge difference, so because of these concerns, it may be useful to
> remove it. I've also heard that we should try to keep some other
> instructions for contributors to find the "right" reviewers, so it would be
> great to see suggestions on that. For my part, I'd personally prefer
> something "automatic", such as easily tracking who reviewed each patch and
> having people look at the commit history of the module they want to work
> on, instead of a list that needs to be maintained separately.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Nicholas Chammas
I’ve also heard that we should try to keep some other instructions for
contributors to find the “right” reviewers, so it would be great to see
suggestions on that. For my part, I’d personally prefer something
“automatic”, such as easily tracking who reviewed each patch and having
people look at the commit history of the module they want to work on,
instead of a list that needs to be maintained separately.

Some code review and management tools like Phabricator have a system for
this , where you can configure
alerts to automatically ping certain people if a file matching some rule
(e.g. has this extension, is in this folder, etc.) is modified by a PR.

I think short of deploying Phabricator somehow, probably the most realistic
option for us to get automatic alerts like this is to have someone add that
as a feature to the Spark PR Dashboard .

I created an issue for this some time ago if anyone wants to take a crack
at it: https://github.com/databricks/spark-pr-dashboard/issues/47

Nick
​

On Thu, May 19, 2016 at 11:42 AM Tom Graves 
wrote:

> +1 (binding)
>
> Tom
>
>
> On Thursday, May 19, 2016 10:35 AM, Matei Zaharia 
> wrote:
>
>
> Hi folks,
>
> Around 1.5 years ago, Spark added a maintainer process for reviewing API
> and architectural changes (
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
> to make sure these are seen by people who spent a lot of time on that
> component. At the time, the worry was that changes might go unnoticed as
> the project grows, but there were also concerns that this approach makes
> the project harder to contribute to and less welcoming. Since implementing
> the model, I think that a good number of developers concluded it doesn't
> make a huge difference, so because of these concerns, it may be useful to
> remove it. I've also heard that we should try to keep some other
> instructions for contributors to find the "right" reviewers, so it would be
> great to see suggestions on that. For my part, I'd personally prefer
> something "automatic", such as easily tracking who reviewed each patch and
> having people look at the commit history of the module they want to work
> on, instead of a list that needs to be maintained separately.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>


right outer joins on Datasets

2016-05-19 Thread Andres Perez
Hi all, I'm getting some odd behavior when using the joinWith functionality
for Datasets. Here is a small test case:

  val left = List(("a", 1), ("a", 2), ("b", 3), ("c", 4)).toDS()
  val right = List(("a", "x"), ("b", "y"), ("d", "z")).toDS()

  val joined = left.toDF("k", "v").as[(String, Int)].alias("left")
.joinWith(right.toDF("k", "u").as[(String, String)].alias("right"),
functions.col("left.k") === functions.col("right.k"), "right_outer")
.as[((String, Int), (String, String))]
.map { case ((k, v), (_, u)) => (k, (v, u)) }.as[(String, (Int,
String))]

I would expect the result of this right-join to be:

  (a,(1,x))
  (a,(2,x))
  (b,(3,y))
  (d,(null,z))

but instead I'm getting:

  (a,(1,x))
  (a,(2,x))
  (b,(3,y))
  (null,(-1,z))

Not that the key for the final tuple is null instead of "d". (Also, is
there a reason the value for the left-side of the last tuple is -1 and not
null?)

-Andy


Re: SparkR dataframe error

2016-05-19 Thread Xiangrui Meng
Is it on 1.6.x?

On Wed, May 18, 2016, 6:57 PM Sun Rui  wrote:

> I saw it, but I can’t see the complete error message on it.
> I mean the part after “error in invokingJava(…)”
>
> On May 19, 2016, at 08:37, Gayathri Murali 
> wrote:
>
> There was a screenshot attached to my original email. If you did not get
> it, attaching here again.
>
> On Wed, May 18, 2016 at 5:27 PM, Sun Rui  wrote:
>
>> It’s wrong behaviour that head(df) outputs no row
>> Could you send a screenshot displaying whole error message?
>>
>> On May 19, 2016, at 08:12, Gayathri Murali 
>> wrote:
>>
>> I am trying to run a basic example on Interactive R shell and run into
>> the following error. Also note that head(df) does not display any rows. Can
>> someone please help if I am missing something?
>>
>> 
>>
>>  Thanks
>> Gayathri
>>
>> [image: 提示图标] 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
>> 共有 *1* 个附件
>> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)极速下载
>> 
>>  在线预览
>> 
>>
>>
>>
> [image: 提示图标] 邮件带有附件预览链接,若您转发或回复此邮件时不希望对方预览附件,建议您手动删除链接。
> 共有 *1* 个附件
> Screen Shot 2016-05-18 at 5.09.29 PM.png(155K)极速下载
> 
>  在线预览
> 
> 
>
>
>


Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
Not exacly the same as the one you suggested but you can chain it with
flatMap to get what you want, if each file is not huge.

On Thu, May 19, 2016, 8:41 AM Xiangrui Meng  wrote:

> This was implemented as sc.wholeTextFiles.
>
> On Thu, May 19, 2016, 2:43 AM Reynold Xin  wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com> wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>> val conf = sc.hadoopConfiguration
>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>   map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>


Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
It is different isn't it. Whole text files returns one element per file,
whereas combined inout format is similar to coalescing partitions to bin
pack into a certain size.

On Thursday, May 19, 2016, Xiangrui Meng  wrote:

> This was implemented as sc.wholeTextFiles.
>
> On Thu, May 19, 2016, 2:43 AM Reynold Xin  > wrote:
>
>> Users would be able to run this already with the 3 lines of code you
>> supplied right? In general there are a lot of methods already on
>> SparkContext and we lean towards the more conservative side in introducing
>> new API variants.
>>
>> Note that this is something we are doing automatically in Spark SQL for
>> file sources (Dataset/DataFrame).
>>
>>
>> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov <
>> apivova...@gmail.com
>> > wrote:
>>
>>> Hello Everyone
>>>
>>> Do you think it would be useful to add combinedTextFile method (which
>>> uses CombineTextInputFormat) to SparkContext?
>>>
>>> It allows one task to read data from multiple text files and control
>>> number of RDD partitions by setting
>>> mapreduce.input.fileinputformat.split.maxsize
>>>
>>>
>>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>>> val conf = sc.hadoopConfiguration
>>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>>> classOf[LongWritable], classOf[Text], conf).
>>>   map(pair => pair._2.toString).setName(path)
>>>   }
>>>
>>>
>>> Alex
>>>
>>
>>


Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Xiangrui Meng
This was implemented as sc.wholeTextFiles.

On Thu, May 19, 2016, 2:43 AM Reynold Xin  wrote:

> Users would be able to run this already with the 3 lines of code you
> supplied right? In general there are a lot of methods already on
> SparkContext and we lean towards the more conservative side in introducing
> new API variants.
>
> Note that this is something we are doing automatically in Spark SQL for
> file sources (Dataset/DataFrame).
>
>
> On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov  > wrote:
>
>> Hello Everyone
>>
>> Do you think it would be useful to add combinedTextFile method (which
>> uses CombineTextInputFormat) to SparkContext?
>>
>> It allows one task to read data from multiple text files and control
>> number of RDD partitions by setting
>> mapreduce.input.fileinputformat.split.maxsize
>>
>>
>>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
>> val conf = sc.hadoopConfiguration
>> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
>> classOf[LongWritable], classOf[Text], conf).
>>   map(pair => pair._2.toString).setName(path)
>>   }
>>
>>
>> Alex
>>
>
>


Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Tom Graves
+1 (binding)
Tom 

On Thursday, May 19, 2016 10:35 AM, Matei Zaharia  
wrote:
 

 Hi folks,

Around 1.5 years ago, Spark added a maintainer process for reviewing API and 
architectural changes 
(https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
 to make sure these are seen by people who spent a lot of time on that 
component. At the time, the worry was that changes might go unnoticed as the 
project grows, but there were also concerns that this approach makes the 
project harder to contribute to and less welcoming. Since implementing the 
model, I think that a good number of developers concluded it doesn't make a 
huge difference, so because of these concerns, it may be useful to remove it. 
I've also heard that we should try to keep some other instructions for 
contributors to find the "right" reviewers, so it would be great to see 
suggestions on that. For my part, I'd personally prefer something "automatic", 
such as easily tracking who reviewed each patch and having people look at the 
commit history of the module they want to work on, instead of a list that needs 
to be maintained separately.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org


  

Re: [DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Mridul Muralidharan
+1 (binding) on removing maintainer process.
I agree with your opinion of "automatic " instead of a manual list.


Regards
Mridul

On Thursday, May 19, 2016, Matei Zaharia  wrote:

> Hi folks,
>
> Around 1.5 years ago, Spark added a maintainer process for reviewing API
> and architectural changes (
> https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
> to make sure these are seen by people who spent a lot of time on that
> component. At the time, the worry was that changes might go unnoticed as
> the project grows, but there were also concerns that this approach makes
> the project harder to contribute to and less welcoming. Since implementing
> the model, I think that a good number of developers concluded it doesn't
> make a huge difference, so because of these concerns, it may be useful to
> remove it. I've also heard that we should try to keep some other
> instructions for contributors to find the "right" reviewers, so it would be
> great to see suggestions on that. For my part, I'd personally prefer
> something "automatic", such as easily tracking who reviewed each patch and
> having people look at the commit history of the module they want to work
> on, instead of a list that needs to be maintained separately.
>
> Matei
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org 
> For additional commands, e-mail: dev-h...@spark.apache.org 
>
>


[DISCUSS] Removing or changing maintainer process

2016-05-19 Thread Matei Zaharia
Hi folks,

Around 1.5 years ago, Spark added a maintainer process for reviewing API and 
architectural changes 
(https://cwiki.apache.org/confluence/display/SPARK/Committers#Committers-ReviewProcessandMaintainers)
 to make sure these are seen by people who spent a lot of time on that 
component. At the time, the worry was that changes might go unnoticed as the 
project grows, but there were also concerns that this approach makes the 
project harder to contribute to and less welcoming. Since implementing the 
model, I think that a good number of developers concluded it doesn't make a 
huge difference, so because of these concerns, it may be useful to remove it. 
I've also heard that we should try to keep some other instructions for 
contributors to find the "right" reviewers, so it would be great to see 
suggestions on that. For my part, I'd personally prefer something "automatic", 
such as easily tracking who reviewed each patch and having people look at the 
commit history of the module they want to work on, instead of a list that needs 
to be maintained separately.

Matei
-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Spark Security. Generating SSL keystore for each job

2016-05-19 Thread ScoRp
Hello,

I have a question about Spark Security.

On Spark Documentation it is stated that when Spark is running on YARN,
keystore for SSL encryption should be generated for each job and may be
distributed using spark.yarn.dist.files option. But there is no
implementation for this is Spark Source Code, meaning that end user needs to
make changes himself in order to generate keystore for each job
automatically.

What does community think about this? Should this feature be implemented in
Spark Source Codes or there are some objections about having it
out-of-the-box? Or is Spark going to lose support of SSL encryption and use
some other encryption protocol in near future?

Best Regards,
Nikita



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Security-Generating-SSL-keystore-for-each-job-tp17533.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: combitedTextFile and CombineTextInputFormat

2016-05-19 Thread Reynold Xin
Users would be able to run this already with the 3 lines of code you
supplied right? In general there are a lot of methods already on
SparkContext and we lean towards the more conservative side in introducing
new API variants.

Note that this is something we are doing automatically in Spark SQL for
file sources (Dataset/DataFrame).


On Sat, May 14, 2016 at 8:13 PM, Alexander Pivovarov 
wrote:

> Hello Everyone
>
> Do you think it would be useful to add combinedTextFile method (which uses
> CombineTextInputFormat) to SparkContext?
>
> It allows one task to read data from multiple text files and control
> number of RDD partitions by setting
> mapreduce.input.fileinputformat.split.maxsize
>
>
>   def combinedTextFile(sc: SparkContext)(path: String): RDD[String] = {
> val conf = sc.hadoopConfiguration
> sc.newAPIHadoopFile(path, classOf[CombineTextInputFormat],
> classOf[LongWritable], classOf[Text], conf).
>   map(pair => pair._2.toString).setName(path)
>   }
>
>
> Alex
>


Re: Nested/Chained case statements generate codegen over 64k exception

2016-05-19 Thread Jonathan Gray
That makes sense, I will take a look there first. That will at least give a
clearer understanding of the problem space to determine when to fallback.
On 15 May 2016 3:02 am, "Reynold Xin"  wrote:

> It might be best to fix this with fallback first, and then figure out how
> we can do it more intelligently.
>
>
>
> On Sat, May 14, 2016 at 2:29 AM, Jonathan Gray 
> wrote:
>
>> Hi,
>>
>> I've raised JIRA SPARK-15258 (with code attached to re-produce problem)
>> and would like to have a go at fixing it but don't really know where to
>> start.  Could anyone provide some pointers?
>>
>> I've looked at the code associated with SPARK-13242 but was hoping to
>> find a way to avoid the codegen fallback.  Is this something that is
>> possible?
>>
>> Thanks,
>> Jon
>>
>
>