from:"Ted Yu"

Re: SIGBUS (0xa) when using DataFrameWriter.insertInto

2018-10-27 Thread Ted Yu

I don't seem to find the log.
Can you double check ?
Thanks
 Original message From: alexzautke 
 Date: 10/27/18  8:54 AM  (GMT-08:00) To: 
user@spark.apache.org Subject: Re: SIGBUS (0xa) when using 
DataFrameWriter.insertInto 
Please also find attached a complete error log.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: error while submitting job

2018-09-29 Thread Ted Yu

Can you tell us the version of Spark and the connector you used ?
Thanks 
 Original message From: yuvraj singh 
<19yuvrajsing...@gmail.com> Date: 9/29/18  10:42 PM  (GMT-08:00) To: 
user@spark.apache.org Subject: error while submitting job 


Hi , i am getting this error please help me .

18/09/30 05:14:44 INFO Client:   client token: N/A   diagnostics: User 
class threw exception: java.lang.NoClassDefFoundError: 
org/apache/commons/configuration/ConfigurationException   at 
org.apache.spark.sql.cassandra.DefaultSource$.(DefaultSource.scala:135)   
 at 
org.apache.spark.sql.cassandra.DefaultSource$.(DefaultSource.scala)  at 
org.apache.spark.sql.cassandra.DefaultSource.createRelation(DefaultSource.scala:55)
  at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:340)
  at 
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at 
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164) at 
com.ola.ss.lhf.LoginHourFetcher.main(LoginHourFetcher.java:39)   at 
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)   
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498) at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$4.run(ApplicationMaster.scala:721)Caused
 by: java.lang.ClassNotFoundException: 
org.apache.commons.configuration.ConfigurationException  at 
java.net.URLClassLoader.findClass(URLClassLoader.java:381)   at 
java.lang.ClassLoader.loadClass(ClassLoader.java:424)at 
java.lang.ClassLoader.loadClass(ClassLoader.java:357)... 13 more
 ApplicationMaster host: 10.14.58.163ApplicationMaster RPC port: 0  
 queue: signals  start time: 1538284434296   final status: FAILED
tracking URL: 
http://as-data3.prod-ambari16.olacabs.net:8088/proxy/application_1537943933870_0088/
  user: yubrajsinghException in thread "main" 
org.apache.spark.SparkException: Application application_1537943933870_0088 
finished with failed statusat 
org.apache.spark.deploy.yarn.Client.run(Client.scala:1269)   at 
org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1627) at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:904)
 at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198) 
 at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)   at 
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137) at 
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


Thanks Yubraj Singh

Re: OOM: Structured Streaming aggregation state not cleaned up properly

2018-05-19 Thread Ted Yu

Hi,
w.r.t. ElementTrackingStore, since it is backed by KVStore, there should be
other classes which occupy significant memory.

Can you pastebin the top 10 entries among the heap dump ?

Thanks

Re: KafkaUtils.createStream(..) is removed for API

2018-02-18 Thread Ted Yu

createStream() is still
in 
external/kafka-0-8/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala
But it is not
in 
external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaUtils.scala

FYI

On Sun, Feb 18, 2018 at 5:17 PM, naresh Goud 
wrote:

> Hello Team,
>
> I see "KafkaUtils.createStream() " method not available in spark 2.2.1.
>
> Can someone please confirm if these methods are removed?
>
> below is my pom.xml entries.
>
>
> 
>   2.11.8
>   2.11
> 
>
>
>   
>   org.apache.spark
>   spark-streaming_${scala.tools.version}
>   2.2.1
>   provided
>   
> 
>   org.apache.spark
>   spark-streaming-kafka-0-10_2.11
>   2.2.1
>   provided
> 
> 
>   org.apache.spark
>   spark-core_2.11
>   2.2.1
>   provided
> 
>   
>
>
>
>
>
> Thank you,
> Naresh
>

Re: Broken SQL Visualization?

2018-01-15 Thread Ted Yu

Did you include any picture ?
Looks like the picture didn't go thru.
Please use third party site. 
Thanks
 Original message From: Tomasz Gawęda 
 Date: 1/15/18  2:07 PM  (GMT-08:00) To: 
d...@spark.apache.org, user@spark.apache.org Subject: Broken SQL Visualization? 

Hi,
today I have updated my test cluster to current Spark master, after that my SQL 
Visualization page started to crash with following error in JS:

Screenshot was cut for readability and to hide internal server names ;)


It may be caused by upgrade or by some code changes, but - to be honest - I did 
not use any new operators nor any new Spark function, so it should render 
correctly, like few days ago. Some Visualizations work fine, some crashes, I 
don't have any doubts why
 it may not work. Can anyone help me? Probably it is a bug in Spark, but it's 
hard to me to say in which place.



Thanks in advance!
Pozdrawiam / Best regards,
Tomek

Re: how to mention others in JIRA comment please?

2017-06-26 Thread Ted Yu

You can find the JIRA handle of the person you want to mention by going to
a JIRA where that person has commented.

e.g. you want to find the handle for Joseph.
You can go to:
https://issues.apache.org/jira/browse/SPARK-6635

and click on his name in comment:
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=josephkb

The following constitutes a mention for him:
[~josephkb]

FYI

On Mon, Jun 26, 2017 at 6:56 PM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> how to mention others in JIRA comment please?
> I added @ before other members' name, but it didn't work.
>
> Would you like help me please?
>
> thanks
> Fei Shao
>

Re: the compile of spark stoped without any hints, would you like help me please?

2017-06-25 Thread Ted Yu

Does adding -X to mvn command give you more information ?

Cheers

On Sun, Jun 25, 2017 at 5:29 AM, 萝卜丝炒饭 <1427357...@qq.com> wrote:

> Hi all,
>
> Today I use new PC to compile SPARK.
> At the beginning, it worked well.
> But it stop at some point.
> the content in consle is :
> 
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-shade-plugin:2.4.3:shade (default) @ spark-parent_2.11 ---
> [INFO] Including org.spark-project.spark:unused:jar:1.0.0 in the shaded
> jar.
> [INFO] Replacing original artifact with shaded artifact.
> [INFO]
> [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @
> spark-parent_2.11 ---
> [INFO]
> [INFO] 
> 
> [INFO] Building Spark Project Tags 2.1.2-SNAPSHOT
> [INFO] 
> 
> [INFO]
> [INFO] --- maven-enforcer-plugin:1.4.1:enforce (enforce-versions) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:add-source (eclipse-add-source) @
> spark-tags_2.11 ---
> [INFO] Add Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\scala
> [INFO] Add Test Source directory: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\scala
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath (default-cli) @
> spark-tags_2.11 ---
> [INFO] Dependencies classpath:
> C:\Users\shaof\.m2\repository\org\spark-project\spark\
> unused\1.0.0\unused-1.0.0.jar;C:\Users\shaof\.m2\repository\
> org\scala-lang\scala-library\2.11.8\scala-library-2.11.8.jar
> [INFO]
> [INFO] --- maven-remote-resources-plugin:1.5:process (default) @
> spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @
> spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\main\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @
> spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 2 Scala sources and 6 Java sources to
> E:\spark\fromweb\spark-branch-2.1\common\tags\target\scala-2.11\classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:compile (default-compile) @
> spark-tags_2.11 ---
> [INFO] Changes detected - recompiling the module!
> [INFO] Compiling 6 source files to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\classes
> [INFO]
> [INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ spark-tags_2.11
> ---
> [INFO] Executing tasks
>
> main:
> [INFO] Executed tasks
> [INFO]
> [INFO] --- maven-resources-plugin:2.6:testResources
> (default-testResources) @ spark-tags_2.11 ---
> [INFO] Using 'UTF-8' encoding to copy filtered resources.
> [INFO] skip non existing resourceDirectory E:\spark\fromweb\spark-branch-
> 2.1\common\tags\src\test\resources
> [INFO] Copying 3 resources
> [INFO]
> [INFO] --- scala-maven-plugin:3.2.2:testCompile
> (scala-test-compile-first) @ spark-tags_2.11 ---
> [WARNING] Zinc server is not available at port 3030 - reverting to normal
> incremental compile
> [INFO] Using incremental compilation
> [INFO] Compiling 3 Java sources to E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\scala-2.11\test-classes...
> [WARNING] 警告: [options] 未与 -source 1.7 一起设置引导类路径
> [WARNING] 1 个警告
> [INFO]
> [INFO] --- maven-compiler-plugin:3.5.1:testCompile (default-testCompile)
> @ spark-tags_2.11 ---
> [INFO] Nothing to compile - all classes are up to date
> [INFO]
> [INFO] --- maven-dependency-plugin:2.10:build-classpath
> (generate-test-classpath) @ spark-tags_2.11 ---
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (default-test) @
> spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-surefire-plugin:2.19.1:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- scalatest-maven-plugin:1.0:test (test) @ spark-tags_2.11 ---
> [INFO] Tests are skipped.
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:test-jar (prepare-test-jar) @
> spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT-tests.jar
> [INFO]
> [INFO] --- maven-jar-plugin:2.6:jar (default-jar) @ spark-tags_2.11 ---
> [INFO] Building jar: E:\spark\fromweb\spark-branch-
> 2.1\common\tags\target\spark-tags_2.11-2.1.2-SNAPSHOT.jar
> [INFO]
> [INFO] ---

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu

Does the storage handler provide bulk load capability ?

Cheers

> On Jan 25, 2017, at 3:39 AM, Amrit Jangid  wrote:
> 
> Hi chetan,
> 
> If you just need HBase Data into Hive, You can use Hive EXTERNAL TABLE with 
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'.
> 
> Try this if you problem can be solved 
> 
> https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration
> 
> Regards
> Amrit
> 
> .
> 
>> On Wed, Jan 25, 2017 at 5:02 PM, Chetan Khatri  
>> wrote:
>> Hello Spark Community Folks,
>> 
>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk Load 
>> from Hbase to Hive.
>> 
>> I have seen couple of good example at HBase Github Repo: 
>> https://github.com/apache/hbase/tree/master/hbase-spark
>> 
>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ? 
>> Or which version of HBase has more stability with HBaseContext ?
>> 
>> Thanks.
> 
> 
>

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu

The references are vendor specific.

Suggest contacting vendor's mailing list for your PR.

My initial interpretation of HBase repository is that of Apache.

Cheers

On Wed, Jan 25, 2017 at 7:38 AM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> @Ted Yu, Correct but HBase-Spark module available at HBase repository
> seems too old and written code is not optimized yet, I have been already
> submitted PR for the same. I dont know if it is clearly mentioned that now
> it is part of HBase itself then people are committing to older repo where
> original code is still old. [1]
>
> Other sources has updated info [2]
>
> Ref.
> [1] http://blog.cloudera.com/blog/2015/08/apache-spark-
> comes-to-apache-hbase-with-hbase-spark-module/
> [2] https://github.com/cloudera-labs/SparkOnHBase ,
> https://github.com/esamson/SparkOnHBase
>
> On Wed, Jan 25, 2017 at 8:13 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Though no hbase release has the hbase-spark module, you can find the
>> backport patch on HBASE-14160 (for Spark 1.6)
>>
>> You can build the hbase-spark module yourself.
>>
>> Cheers
>>
>> On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Spark Community Folks,
>>>
>>> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
>>> Load from Hbase to Hive.
>>>
>>> I have seen couple of good example at HBase Github Repo:
>>> https://github.com/apache/hbase/tree/master/hbase-spark
>>>
>>> If I would like to use HBaseContext with HBase 1.2.4, how it can be done
>>> ? Or which version of HBase has more stability with HBaseContext ?
>>>
>>> Thanks.
>>>
>>
>>
>

Re: HBaseContext with Spark

2017-01-25 Thread Ted Yu

Though no hbase release has the hbase-spark module, you can find the
backport patch on HBASE-14160 (for Spark 1.6)

You can build the hbase-spark module yourself.

Cheers

On Wed, Jan 25, 2017 at 3:32 AM, Chetan Khatri 
wrote:

> Hello Spark Community Folks,
>
> Currently I am using HBase 1.2.4 and Hive 1.2.1, I am looking for Bulk
> Load from Hbase to Hive.
>
> I have seen couple of good example at HBase Github Repo:
> https://github.com/apache/hbase/tree/master/hbase-spark
>
> If I would like to use HBaseContext with HBase 1.2.4, how it can be done ?
> Or which version of HBase has more stability with HBaseContext ?
>
> Thanks.
>

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu

Incremental load traditionally means generating hfiles and
using org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to load the
data into hbase.

For your use case, the producer needs to find rows where the flag is 0 or 1.
After such rows are obtained, it is up to you how the result of processing
is delivered to hbase.

Cheers

On Wed, Dec 21, 2016 at 8:00 AM, Chetan Khatri <chetan.opensou...@gmail.com>
wrote:

> Ok, Sure will ask.
>
> But what would be generic best practice solution for Incremental load from
> HBASE.
>
> On Wed, Dec 21, 2016 at 8:42 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I haven't used Gobblin.
>> You can consider asking Gobblin mailing list of the first option.
>>
>> The second option would work.
>>
>>
>> On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri <
>> chetan.opensou...@gmail.com> wrote:
>>
>>> Hello Guys,
>>>
>>> I would like to understand different approach for Distributed
>>> Incremental load from HBase, Is there any *tool / incubactor tool* which
>>> satisfy requirement ?
>>>
>>> *Approach 1:*
>>>
>>> Write Kafka Producer and maintain manually column flag for events and
>>> ingest it with Linkedin Gobblin to HDFS / S3.
>>>
>>> *Approach 2:*
>>>
>>> Run Scheduled Spark Job - Read from HBase and do transformations and
>>> maintain flag column at HBase Level.
>>>
>>> In above both approach, I need to maintain column level flags. such as 0
>>> - by default, 1-sent,2-sent and acknowledged. So next time Producer will
>>> take another 1000 rows of batch where flag is 0 or 1.
>>>
>>> I am looking for best practice approach with any distributed tool.
>>>
>>> Thanks.
>>>
>>> - Chetan Khatri
>>>
>>
>>
>

Re: Approach: Incremental data load from HBASE

2016-12-21 Thread Ted Yu

I haven't used Gobblin.
You can consider asking Gobblin mailing list of the first option.

The second option would work.


On Wed, Dec 21, 2016 at 2:28 AM, Chetan Khatri 
wrote:

> Hello Guys,
>
> I would like to understand different approach for Distributed Incremental
> load from HBase, Is there any *tool / incubactor tool* which satisfy
> requirement ?
>
> *Approach 1:*
>
> Write Kafka Producer and maintain manually column flag for events and
> ingest it with Linkedin Gobblin to HDFS / S3.
>
> *Approach 2:*
>
> Run Scheduled Spark Job - Read from HBase and do transformations and
> maintain flag column at HBase Level.
>
> In above both approach, I need to maintain column level flags. such as 0 -
> by default, 1-sent,2-sent and acknowledged. So next time Producer will take
> another 1000 rows of batch where flag is 0 or 1.
>
> I am looking for best practice approach with any distributed tool.
>
> Thanks.
>
> - Chetan Khatri
>

Re: namespace quota not take effect

2016-08-25 Thread Ted Yu

This question should have been posted to user@

Looks like you were using wrong config.
See:
http://hbase.apache.org/book.html#quota

See 'Setting Namespace Quotas' section further down.

Cheers

On Tue, Aug 23, 2016 at 11:38 PM, W.H  wrote:

> hi guys
>   I am testing the hbase  namespace quota at the maxTables and
> maxRegions.Followed the guide  i add the option  "hbase.quota.enabled" with
> value "true" in the hbase-site.xml .And then created the namespace :
>hbase(main):003:0> describe_namespace 'ns1'
>DESCRIPTION
>   {NAME => 'ns1', maxregions => '2', maxtables => '1'}
>
>   In the table definition a limited the maxtables as 1 ,but i created 5
> tables under the namespace "ns1".It seems the quota  did't take effect .
>   The hbase cluster was restarted after  hbase-site.xml modified.And my
> hbase version is 1.1.2.2.4.
>   Any ideas ?Thanks .
>
>
>
>  Best wishes.
>  who.cat

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu

Please include user@ in your reply.

Can you reveal the snippet of hive sql ?

On Wed, Aug 17, 2016 at 9:04 AM, vr spark <vrspark...@gmail.com> wrote:

> spark 1.6.1
> mesos
> job is running for like 10-15 minutes and giving this message and i killed
> it.
>
> In this job, i am creating data frame from a hive sql. There are other
> similar jobs which work fine
>
> On Wed, Aug 17, 2016 at 8:52 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Can you provide more information ?
>>
>> Were you running on YARN ?
>> Which version of Spark are you using ?
>>
>> Was your job failing ?
>>
>> Thanks
>>
>> On Wed, Aug 17, 2016 at 8:46 AM, vr spark <vrspark...@gmail.com> wrote:
>>
>>>
>>> W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492
>>>
>>> W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493
>>>
>>> W0816 23:17:01.985124 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910494
>>>
>>> W0816 23:17:01.985339 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910495
>>>
>>> W0816 23:17:01.985508 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910496
>>>
>>> W0816 23:17:01.985651 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910497
>>>
>>> W0816 23:17:01.985801 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910498
>>>
>>> W0816 23:17:01.985961 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910499
>>>
>>> W0816 23:17:01.986121 16360 sched.cpp:1195] Attempting to accept an
>>> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910500
>>>
>>> 2016-08-16 23:18:41,877:16226(0x7f71271b6700):ZOO_WARN@zookeeper_intere
>>> st@1557: Exceeded deadline by 13ms
>>>
>>> 2016-08-16 23:21:12,007:16226(0x7f71271b6700):ZOO_WARN@zookeeper_intere
>>> st@1557: Exceeded deadline by 11ms
>>>
>>>
>>>
>>>
>>
>

Re: Attempting to accept an unknown offer

2016-08-17 Thread Ted Yu

Can you provide more information ?

Were you running on YARN ?
Which version of Spark are you using ?

Was your job failing ?

Thanks

On Wed, Aug 17, 2016 at 8:46 AM, vr spark  wrote:

>
> W0816 23:17:01.984846 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910492
>
> W0816 23:17:01.984987 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910493
>
> W0816 23:17:01.985124 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910494
>
> W0816 23:17:01.985339 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910495
>
> W0816 23:17:01.985508 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910496
>
> W0816 23:17:01.985651 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910497
>
> W0816 23:17:01.985801 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910498
>
> W0816 23:17:01.985961 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910499
>
> W0816 23:17:01.986121 16360 sched.cpp:1195] Attempting to accept an
> unknown offer b859f2f3-7484-482d-8c0d-35bd91c1ad0a-O162910500
>
> 2016-08-16 23:18:41,877:16226(0x7f71271b6700):ZOO_WARN@zookeeper_
> interest@1557: Exceeded deadline by 13ms
>
> 2016-08-16 23:21:12,007:16226(0x7f71271b6700):ZOO_WARN@zookeeper_
> interest@1557: Exceeded deadline by 11ms
>
>
>
>

Re: Undefined function json_array_to_map

2016-08-17 Thread Ted Yu

Can you show the complete stack trace ?

Which version of Spark are you using ?

Thanks

On Wed, Aug 17, 2016 at 8:46 AM, vr spark  wrote:

> Hi,
> I am getting error on below scenario. Please suggest.
>
> i have  a virtual view in hive
>
> view name log_data
> it has 2 columns
>
> query_map   map
>
> parti_date int
>
>
> Here is my snippet for the spark data frame
>
> my dataframe
>
> res=sqlcont.sql("select parti_date FROM log_data WHERE parti_date  >=
> 408910 limit 10")
>
> df=res.collect()
>
> print 'after collect'
>
> print df
>
>
> * File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
> line 51, in deco*
>
> *pyspark.sql.utils.AnalysisException: u'undefined function
> json_array_to_map; line 28 pos 73'*
>
>
>
>
>

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu

Can you take a look at commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf ?

There was a test:
SPARK-15285 Generated SpecificSafeProjection.apply method grows beyond 64KB

See if it matches your use case.

On Tue, Aug 16, 2016 at 8:41 AM, Aris <arisofala...@gmail.com> wrote:

> I am still working on making a minimal test that I can share without my
> work-specific code being in there. However, the problem occurs with a
> dataframe with several hundred columns being asked to do a tension split.
> Random split works with up to about 350 columns so far. It breaks in my
> code with 600 columns, but it's a converted dataset of case classes to
> dataframe. This is deterministically causing the error in Scala 2.11.
>
> Once I can get a deterministically breaking test without work code I will
> try to file a Jira bug.
>
> On Tue, Aug 16, 2016, 04:17 Ted Yu <yuzhih...@gmail.com> wrote:
>
>> I think we should reopen it.
>>
>> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com>
>> wrote:
>>
>> I just realized it since it broken a build with Scala 2.10.
>> https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203a
>> e1a2e9a1cf
>>
>> I can reproduce the problem in SPARK-15285 with master branch.
>> Should we reopen SPARK-15285?
>>
>> Best Regards,
>> Kazuaki Ishizaki,
>>
>>
>>
>> From:Ted Yu <yuzhih...@gmail.com>
>> To:dhruve ashar <dhruveas...@gmail.com>
>> Cc:Aris <arisofala...@gmail.com>, "user@spark.apache.org" <
>> user@spark.apache.org>
>> Date:2016/08/15 06:19
>> Subject:Re: Spark 2.0.0 JaninoRuntimeException
>> --
>>
>>
>>
>> Looks like the proposed fix was reverted:
>>
>> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
>> method grows beyond 64 KB"
>>
>> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
>>
>> Maybe this was fixed in some other JIRA ?
>>
>> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar <*dhruveas...@gmail.com*
>> <dhruveas...@gmail.com>> wrote:
>> I see a similar issue being resolved recently:
>> *https://issues.apache.org/jira/browse/SPARK-15285*
>> <https://issues.apache.org/jira/browse/SPARK-15285>
>>
>> On Fri, Aug 12, 2016 at 3:33 PM, Aris <*arisofala...@gmail.com*
>> <arisofala...@gmail.com>> wrote:
>> Hello folks,
>>
>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>> cryptic error messages:
>>
>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/
>> apache/spark/sql/catalyst/InternalRow;)I" of class "org.apache.spark.sql.
>> catalyst.expressions.GeneratedClass$SpecificOrdering" grows beyond 64 KB
>>
>> Unfortunately I'm not clear on how to even isolate the source of this
>> problem. I didn't have this problem in Spark 1.6.1.
>>
>> Any clues?
>>
>>
>>
>> --
>> -Dhruve Ashar
>>
>>
>>
>>

Re: long lineage

2016-08-16 Thread Ted Yu

Have you tried periodic checkpoints ?

Cheers

> On Aug 16, 2016, at 5:50 AM, pseudo oduesp  wrote:
> 
> Hi ,
>  how we can deal after raise stackoverflow trigger by long lineage ?
> i mean i have this error and how resolve it wiyhout creating new session 
> thanks 
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: class not found exception Logging while running JavaKMeansExample

2016-08-16 Thread Ted Yu

The class is:
core/src/main/scala/org/apache/spark/internal/Logging.scala

So it is in spark-core.

On Tue, Aug 16, 2016 at 2:33 AM, subash basnet <yasub...@gmail.com> wrote:

> Hello Yuzhihong,
>
> I didn't get how to implement what you said in the JavaKMeansExample.java.
> As I get the logging exception as while creating the spark session:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/internal/Logging
> at com.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.
> main(JavaKMeansExample.java*:43*)
> Caused by: java.lang.ClassNotFoundException: org.apache.spark.internal.
> Logging
>
> The exception occurs at the *builder()*:
>
> 42SparkSession spark = SparkSession
> *43 .builder()*
> 44  .appName("JavaKMeansExample")
> 45 .getOrCreate();
>
> I have added all the necessary log4j and sl4j dependencies in pom. I am
> still not getting what dependencies I am missing.
>
> Best Regards,
> Subash Basnet
>
> On Mon, Aug 15, 2016 at 6:50 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Logging has become private in 2.0 release:
>>
>> private[spark] trait Logging {
>>
>> On Mon, Aug 15, 2016 at 9:48 AM, subash basnet <yasub...@gmail.com>
>> wrote:
>>
>>> Hello all,
>>>
>>> I am trying to run JavaKMeansExample of the spark example project. I am
>>> getting the classnotfound exception error:
>>> *Exception in thread "main" java.lang.NoClassDefFoundError:
>>> org/apache/spark/internal/Logging*
>>> at java.lang.ClassLoader.defineClass1(Native Method)
>>> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
>>> at jcom.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.main(Ja
>>> vaKMeansExample.java:43)
>>> *Caused by: java.lang.ClassNotFoundException:
>>> org.apache.spark.internal.Logging*
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>>>
>>> I have added all the logging related dependencies as below:
>>>  org.slf4j slf4j-api
>>> ${slf4j.version}  
>>> org.slf4j slf4j-log4j12
>>> ${slf4j.version} ${hadoop.deps.scope}
>>>   org.slf4j
>>> jul-to-slf4j ${slf4j.version}
>>>   org.slf4j
>>> jcl-over-slf4j ${slf4j.version}
>>> log4j
>>> log4j ${log4j.version}
>>>   commons-logging
>>> commons-logging 1.2
>>>  What depedencies could I be missing, any idea? Regards,
>>> Subash Basnet
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>
>>
>

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-16 Thread Ted Yu

I think we should reopen it. 

> On Aug 16, 2016, at 1:48 AM, Kazuaki Ishizaki <ishiz...@jp.ibm.com> wrote:
> 
> I just realized it since it broken a build with Scala 2.10.
> https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf
> 
> I can reproduce the problem in SPARK-15285 with master branch.
> Should we reopen SPARK-15285?
> 
> Best Regards,
> Kazuaki Ishizaki,
> 
> 
> 
> From:Ted Yu <yuzhih...@gmail.com>
> To:dhruve ashar <dhruveas...@gmail.com>
> Cc:Aris <arisofala...@gmail.com>, "user@spark.apache.org" 
> <user@spark.apache.org>
> Date:2016/08/15 06:19
> Subject:Re: Spark 2.0.0 JaninoRuntimeException
> 
> 
> 
> Looks like the proposed fix was reverted:
> 
> Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method 
> grows beyond 64 KB"
> 
> This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.
> 
> Maybe this was fixed in some other JIRA ?
> 
> On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar <dhruveas...@gmail.com> wrote:
> I see a similar issue being resolved recently: 
> https://issues.apache.org/jira/browse/SPARK-15285
> 
> On Fri, Aug 12, 2016 at 3:33 PM, Aris <arisofala...@gmail.com> wrote:
> Hello folks,
> 
> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that smaller 
> data unit tests work on my laptop, when I'm on a cluster, I get cryptic error 
> messages:
> 
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/spark/sql/catalyst/InternalRow;)I"
>  of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering" 
> grows beyond 64 KB
> 
> Unfortunately I'm not clear on how to even isolate the source of this 
> problem. I didn't have this problem in Spark 1.6.1.
> 
> Any clues? 
> 
> 
> 
> -- 
> -Dhruve Ashar
> 
> 
>

Re: class not found exception Logging while running JavaKMeansExample

2016-08-15 Thread Ted Yu

Logging has become private in 2.0 release:

private[spark] trait Logging {

On Mon, Aug 15, 2016 at 9:48 AM, subash basnet  wrote:

> Hello all,
>
> I am trying to run JavaKMeansExample of the spark example project. I am
> getting the classnotfound exception error:
> *Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/internal/Logging*
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
> at jcom.dfki.spark.kmeans.KMeansSpark.JavaKMeansExample.
> main(JavaKMeansExample.java:43)
> *Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.internal.Logging*
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> I have added all the logging related dependencies as below:
>  org.slf4j slf4j-api
> ${slf4j.version}  
> org.slf4j slf4j-log4j12
> ${slf4j.version} ${hadoop.deps.scope}
>   org.slf4j
> jul-to-slf4j ${slf4j.version}
>   org.slf4j
> jcl-over-slf4j ${slf4j.version}
> log4j
> log4j ${log4j.version}
>   commons-logging
> commons-logging 1.2
>  What depedencies could I be missing, any idea? Regards,
> Subash Basnet
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>

Re: Spark 2.0.0 JaninoRuntimeException

2016-08-14 Thread Ted Yu

Looks like the proposed fix was reverted:

Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply
method grows beyond 64 KB"

This reverts commit fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf.

Maybe this was fixed in some other JIRA ?

On Fri, Aug 12, 2016 at 2:30 PM, dhruve ashar  wrote:

> I see a similar issue being resolved recently: https://issues.
> apache.org/jira/browse/SPARK-15285
>
> On Fri, Aug 12, 2016 at 3:33 PM, Aris  wrote:
>
>> Hello folks,
>>
>> I'm on Spark 2.0.0 working with Datasets -- and despite the fact that
>> smaller data unit tests work on my laptop, when I'm on a cluster, I get
>> cryptic error messages:
>>
>> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method
>>> "(Lorg/apache/spark/sql/catalyst/InternalRow;Lorg/apache/
>>> spark/sql/catalyst/InternalRow;)I" of class
>>> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificOrdering"
>>> grows beyond 64 KB
>>>
>>
>> Unfortunately I'm not clear on how to even isolate the source of this
>> problem. I didn't have this problem in Spark 1.6.1.
>>
>> Any clues?
>>
>
>
>
> --
> -Dhruve Ashar
>
>

Re: Why I can't use broadcast var defined in a global object?

2016-08-13 Thread Ted Yu

Can you (or David) resend David's reply ?

I don't see the reply in this thread. 

Thanks

> On Aug 13, 2016, at 8:39 PM, yaochunnan  wrote:
> 
> Hi David, 
> Your answers have solved my problem! Detailed and accurate. Thank you very
> much!
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-can-t-use-broadcast-var-defined-in-a-global-object-tp27523p27531.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Single point of failure with Driver host crashing

2016-08-11 Thread Ted Yu

Have you read
https://spark.apache.org/docs/latest/spark-standalone.html#high-availability
?

FYI

On Thu, Aug 11, 2016 at 12:40 PM, Mich Talebzadeh  wrote:

>
> Hi,
>
> Although Spark is fault tolerant when nodes go down like below:
>
> FROM tmp
> [Stage 1:===>   (20 + 10)
> / 100]16/08/11 20:21:34 ERROR TaskSchedulerImpl: Lost executor 3 on
> xx.xxx.197.216: worker lost
> [Stage 1:>   (44 + 8)
> / 100]
> It can carry on.
>
> However, when the node (the host) that the app was started  on goes down
> the job fails as the driver disappears  as well. Is there a way to avoid
> this single point of failure, assuming what I am stating is valid?
>
>
> Thanks
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu

Mind showing the complete stack trace ?

Thanks

On Mon, Aug 8, 2016 at 12:30 PM, max square <max2subscr...@gmail.com> wrote:

> Thanks Ted for the prompt reply.
>
> There are three or four DFs that are coming from various sources and I'm
> doing a unionAll on them.
>
> val placesProcessed = placesUnchanged.unionAll(placesAddedWithMerchantId).
> unionAll(placesUpdatedFromHotelsWithMerchantId).unionAll(pla
> cesUpdatedFromRestaurantsWithMerchantId).unionAll(placesChanged)
>
> I'm using Spark 1.6.2.
>
> On Mon, Aug 8, 2016 at 3:11 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Can you show the code snippet for unionAll operation ?
>>
>> Which Spark release do you use ?
>>
>> BTW please use user@spark.apache.org in the future.
>>
>> On Mon, Aug 8, 2016 at 11:47 AM, max square <max2subscr...@gmail.com>
>> wrote:
>>
>>> Hey guys,
>>>
>>> I'm trying to save Dataframe in CSV format after performing unionAll
>>> operations on it.
>>> But I get this exception -
>>>
>>> Exception in thread "main" org.apache.spark.sql.catalyst.
>>> errors.package$TreeNodeException: execute, tree:
>>> TungstenExchange hashpartitioning(mId#430,200)
>>>
>>> I'm saving it by
>>>
>>> df.write.format("com.databricks.spark.csv").options(Map("mode" ->
>>> "DROPMALFORMED", "delimiter" -> "\t", "header" -> "true")).save(bakDir
>>> + latest)
>>>
>>> It works perfectly if I don't do the unionAll operation.
>>> I see that the format isn't different by printing the part of the
>>> results.
>>>
>>> Any help regarding this would be appreciated.
>>>
>>>
>>
>

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-08 Thread Ted Yu

Can you show the code snippet for unionAll operation ?

Which Spark release do you use ?

BTW please use user@spark.apache.org in the future.

On Mon, Aug 8, 2016 at 11:47 AM, max square  wrote:

> Hey guys,
>
> I'm trying to save Dataframe in CSV format after performing unionAll
> operations on it.
> But I get this exception -
>
> Exception in thread "main" 
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException:
> execute, tree:
> TungstenExchange hashpartitioning(mId#430,200)
>
> I'm saving it by
>
> df.write.format("com.databricks.spark.csv").options(Map("mode" ->
> "DROPMALFORMED", "delimiter" -> "\t", "header" -> "true")).save(bakDir +
> latest)
>
> It works perfectly if I don't do the unionAll operation.
> I see that the format isn't different by printing the part of the results.
>
> Any help regarding this would be appreciated.
>
>

Re: Multiple Sources Found for Parquet

2016-08-08 Thread Ted Yu

Can you examine classpath to see where *DefaultSource comes from ?*

*Thanks*

On Mon, Aug 8, 2016 at 2:34 AM, 金国栋  wrote:

> I'm using Spark2.0.0 to do sql analysis over parquet files, when using
> `read().parquet("path")`, or `write().parquet("path")` in Java(I followed
> the example java file in source code exactly), I always encountered
>
> *Exception in thread "main" java.lang.RuntimeException: Multiple sources
> found for parquet
> (org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat,
> org.apache.spark.sql.execution.datasources.parquet.DefaultSource), please
> specify the fully qualified class name.*
>
> Any idea why?
>
> Thanks.
>
> Best,
> Jelly
>

Re: submitting spark job with kerberized Hadoop issue

2016-08-07 Thread Ted Yu

The link in Jerry's response was quite old.

Please see:
http://hbase.apache.org/book.html#security

Thanks

On Sun, Aug 7, 2016 at 6:55 PM, Saisai Shao  wrote:

> 1. Standalone mode doesn't support accessing kerberized Hadoop, simply
> because it lacks the mechanism to distribute delegation tokens via cluster
> manager.
> 2. For the HBase token fetching failure, I think you have to do kinit to
> generate tgt before start spark application (http://hbase.apache.org/0.94/
> book/security.html).
>
> On Mon, Aug 8, 2016 at 12:05 AM, Aneela Saleem 
> wrote:
>
>> Thanks Wojciech and Jacek!
>>
>> I tried with Spark on Yarn with kerberized cluster it works fine now. But
>> now when i try to access Hbase through spark i get the following error:
>>
>> 2016-08-07 20:43:57,617 WARN  
>> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: 
>> Exception encountered while connecting to the server : 
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>> 2016-08-07 20:43:57,619 ERROR 
>> [hconnection-0x24b5fa45-metaLookup-shared--pool2-t1] ipc.RpcClientImpl: SASL 
>> authentication failed. The most likely cause is missing or invalid 
>> credentials. Consider 'kinit'.
>> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
>> GSSException: No valid credentials provided (Mechanism level: Failed to find 
>> any Kerberos tgt)]
>>  at 
>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>>  at 
>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
>>  at java.security.AccessController.doPrivileged(Native Method)
>>  at javax.security.auth.Subject.doAs(Subject.java:415)
>>  at 
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:906)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
>>  at 
>> org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1241)
>>  at 
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
>>  at 
>> org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
>>  at 
>> org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:34094)
>>  at 
>> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:201)
>>  at 
>> org.apache.hadoop.hbase.client.ClientSmallScanner$SmallScannerCallable.call(ClientSmallScanner.java:180)
>>  at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:210)
>>  at 
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:360)
>>  at 
>> org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:334)
>>  at 
>> org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:136)
>>  at 
>> org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>  at java.lang.Thread.run(Thread.java:745)
>> Caused by: GSSException: No valid credentials provided (Mechanism level: 
>> Failed to find any Kerberos tgt)
>>  at 
>> sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:147)
>>  at 
>> sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:121)
>>  at 
>> sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:187)
>>  at 
>> sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:223)
>>  at 
>> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
>>  at 
>> sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
>>  at 
>>

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread Ted Yu

I searched *Suite.scala and found only the following contains some classes
extending Transformer :

./mllib/src/test/scala/org/apache/spark/ml/PipelineSuite.scala

But HasInputCol is not used.

FYI

On Sat, Aug 6, 2016 at 11:01 AM, janardhan shetty <janardhan...@gmail.com>
wrote:

> Yes seems like, wondering if this can be made public in order to develop
> custom transformers or any other alternatives ?
>
> On Sat, Aug 6, 2016 at 10:07 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Is it because HasInputCol is private ?
>>
>> private[ml] trait HasInputCol extends Params {
>>
>> On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty <janardhan...@gmail.com>
>> wrote:
>>
>>> Version : 2.0.0-preview
>>>
>>> import org.apache.spark.ml.param._
>>> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>>>
>>>
>>> class CustomTransformer(override val uid: String) extends Transformer
>>> with HasInputCol with HasOutputCol with DefaultParamsWritableimport
>>> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>>> HasInputCol, HasOutputCol}
>>>
>>> *Error in IntelliJ *
>>> Symbol HasInputCol is inaccessible from this place
>>>  similairly for HasOutputCol and DefaultParamsWritable
>>>
>>> Any thoughts on this error as it is not allowing the compile
>>>
>>>
>>>
>>
>

Re: Symbol HasInputCol is inaccesible from this place

2016-08-06 Thread Ted Yu

Is it because HasInputCol is private ?

private[ml] trait HasInputCol extends Params {

On Thu, Aug 4, 2016 at 1:18 PM, janardhan shetty 
wrote:

> Version : 2.0.0-preview
>
> import org.apache.spark.ml.param._
> import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
>
>
> class CustomTransformer(override val uid: String) extends Transformer with
> HasInputCol with HasOutputCol with DefaultParamsWritableimport
> org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
> HasInputCol, HasOutputCol}
>
> *Error in IntelliJ *
> Symbol HasInputCol is inaccessible from this place
>  similairly for HasOutputCol and DefaultParamsWritable
>
> Any thoughts on this error as it is not allowing the compile
>
>
>

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Ted Yu

One option is to clone the class in your own project.

Experts may have better solution.

Cheers

On Fri, Aug 5, 2016 at 10:10 AM, Carlo.Allocca <carlo.allo...@open.ac.uk>
wrote:

> Hi Ted,
>
> Thanks for the promptly answer.
> It is not yet clear to me what I should do.
>
> How to fix it?
>
> Many thanks,
> Carlo
>
> On 5 Aug 2016, at 17:58, Ted Yu <yuzhih...@gmail.com> wrote:
>
> private[spark] trait Logging {
>
>
> -- The Open University is incorporated by Royal Charter (RC 000391), an
> exempt charity in England & Wales and a charity registered in Scotland (SC
> 038302). The Open University is authorised and regulated by the Financial
> Conduct Authority.
>

Re: ClassNotFoundException org.apache.spark.Logging

2016-08-05 Thread Ted Yu

In 2.0, Logging became private:

private[spark] trait Logging {

FYI

On Fri, Aug 5, 2016 at 9:53 AM, Carlo.Allocca 
wrote:

> Dear All,
>
> I would like to ask for your help about the following issue: 
> java.lang.ClassNotFoundException:
> org.apache.spark.Logging
>
> I checked and the class Logging is not present.
> Moreover, the line of code where the exception is thrown
>
> final org.apache.spark.mllib.regression.LinearRegressionModel lrModel
> = LinearRegressionWithSGD.train(a, numIterations,
> stepSize);
>
>
> My POM is as reported below.
>
>
> What am I doing wrong or missing? How I can fix it?
>
> Many Thanks in advice for your support.
>
> Best,
> Carlo
>
>
>
>  POM
>
> 
>
> 
> org.apache.spark
> spark-core_2.10
> 2.0.0
> jar
> 
>
>
> 
> org.apache.spark
> spark-sql_2.10
> 2.0.0
> jar
> 
>
> 
> log4j
> log4j
> 1.2.17
> test
> 
>
>
> 
> org.slf4j
> slf4j-log4j12
> 1.7.16
> test
> 
>
>
> 
> org.apache.hadoop
> hadoop-client
> 2.7.2
> 
>
> 
> junit
> junit
> 4.12
> 
>
> 
> org.hamcrest
> hamcrest-core
> 1.3
> 
> 
> org.apache.spark
> spark-mllib_2.10
> 1.3.0
> jar
> 
>
> 
>
> -- The Open University is incorporated by Royal Charter (RC 000391), an
> exempt charity in England & Wales and a charity registered in Scotland (SC
> 038302). The Open University is authorised and regulated by the Financial
> Conduct Authority.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: What is "Developer API " in spark documentation?

2016-08-05 Thread Ted Yu

See previous discussion :

http://search-hadoop.com/m/q3RTtTvrPrc6O2h1=Re+discuss+separate+API+annotation+into+two+components+InterfaceAudience+InterfaceStability

> On Aug 5, 2016, at 2:55 AM, Aseem Bansal  wrote:
> 
> Hi
> 
> Many of spark documentation say "Developer API". What does that mean?

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: source code for org.spark-project.hive

2016-08-04 Thread Ted Yu

https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2

FYI

On Thu, Aug 4, 2016 at 6:23 AM, prabhat__  wrote:

> hey
> can anyone point me to the source code for the jars used with group-id
> org.spark-project.hive.
> This was previously maintained in the private repo of pwendell
> (https://github.com/pwendell/hive) which doesn't seem to be active now.
>
> where can i find the source code for group: org.spark-project.hive version:
> 1.2.1.spark2
>
>
> Thanks
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/source-code-for-org-spark-project-hive-tp27476.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: how to debug spark app?

2016-08-03 Thread Ted Yu

Have you looked at:
https://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application

If you use Mesos:
https://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging

On Wed, Aug 3, 2016 at 6:13 PM, glen  wrote:

> Any tool like gdb? Which support break point at some line or some function?
>
>
>
>
>
>

Re: java.net.URISyntaxException: Relative path in absolute URI:

2016-08-03 Thread Ted Yu

SPARK-15899  ?

On Wed, Aug 3, 2016 at 11:05 AM, Flavio  wrote:

> Hello everyone,
>
> I am try to run a very easy example but unfortunately I am stuck on the
> follow exception:
>
> Exception in thread "main" java.lang.IllegalArgumentException:
> java.net.URISyntaxException: Relative path in absolute URI: file: "absolute
> directory"
>
> I was wondering if anyone got this exception trying to run the examples on
> the spark git repo; actually the code I am try to run is the follow:
>
>
> //$example on$
> import org.apache.spark.ml.Pipeline;
> import org.apache.spark.ml.PipelineModel;
> import org.apache.spark.ml.PipelineStage;
> import org.apache.spark.ml.evaluation.RegressionEvaluator;
> import org.apache.spark.ml.feature.VectorIndexer;
> import org.apache.spark.ml.feature.VectorIndexerModel;
> import org.apache.spark.ml.regression.RandomForestRegressionModel;
> import org.apache.spark.ml.regression.RandomForestRegressor;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SparkSession;
> //$example off$
>
> public class JavaRandomForestRegressorExample {
> public static void main(String[] args) {
> System.setProperty("hadoop.home.dir", "C:\\winutils");
>
> SparkSession spark = SparkSession
> .builder()
> .master("local[*]")
>
> .appName("JavaRandomForestRegressorExample")
> .getOrCreate();
>
> // $example on$
> // Load and parse the data file, converting it to a
> DataFrame.
> Dataset data =
> spark.read().format("libsvm").load("C:\\data\\sample_libsvm_data.txt");
>
> // Automatically identify categorical features, and index
> them.
> // Set maxCategories so features with > 4 distinct values
> are treated as
> // continuous.
> VectorIndexerModel featureIndexer = new
> VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures")
> .setMaxCategories(4).fit(data);
>
> // Split the data into training and test sets (30% held
> out for testing)
> Dataset[] splits = data.randomSplit(new double[] {
> 0.7, 0.3 });
> Dataset trainingData = splits[0];
> Dataset testData = splits[1];
>
> // Train a RandomForest model.
> RandomForestRegressor rf = new
>
> RandomForestRegressor().setLabelCol("label").setFeaturesCol("indexedFeatures");
>
> // Chain indexer and forest in a Pipeline
> Pipeline pipeline = new Pipeline().setStages(new
> PipelineStage[] {
> featureIndexer, rf });
>
> // Train model. This also runs the indexer.
> PipelineModel model = pipeline.fit(trainingData);
>
> // Make predictions.
> Dataset predictions = model.transform(testData);
>
> // Select example rows to display.
> predictions.select("prediction", "label",
> "features").show(5);
>
> // Select (prediction, true label) and compute test error
> RegressionEvaluator evaluator = new
> RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction")
> .setMetricName("rmse");
> double rmse = evaluator.evaluate(predictions);
> System.out.println("Root Mean Squared Error (RMSE) on test
> data = " +
> rmse);
>
> RandomForestRegressionModel rfModel =
> (RandomForestRegressionModel)
> (model.stages()[1]);
> System.out.println("Learned regression forest model:\n" +
> rfModel.toDebugString());
> // $example off$
>
> spark.stop();
> }
> }
>
>
> Thanks to everyone for reading/answering!
>
> Flavio
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/java-net-URISyntaxException-Relative-path-in-absolute-URI-tp27466.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu

Spark 2.0 has been released.

Mind giving it a try :-) ?

On Wed, Aug 3, 2016 at 9:11 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> OK, thank you. What do you suggest I do to get rid of the error?
>
>
> ----------
> *From:* Ted Yu <yuzhih...@gmail.com>
> *Sent:* Wednesday, August 3, 2016 6:10 PM
> *To:* Rychnovsky, Dusan
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
> The latest QA run was no longer accessible (error 404):
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull
>
> Looking at the comments on the PR, there is not enough confidence in
> pulling in the fix into 1.6
>
> On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> I am confused.
>>
>>
>> I tried to look for Spark that would have this issue fixed, i.e.
>> https://github.com/apache/spark/pull/13027/ merged in, but it looks like
>> the patch has not been merged for 1.6.
>>
>>
>> How do I get a fixed 1.6 version?
>>
>>
>> Thanks,
>>
>> Dusan
>>
>>
>> <https://github.com/apache/spark/pull/13027/>
>> [SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can
>> starve others on the same thread for memory by lianhuiwang · Pull Request
>> #13027 · apache/spark · GitHub
>> What changes were proposed in this pull request? This PR is for the
>> branch-1.6 version of the commits PR #10024. In #9241 It implemented a
>> mechanism to call spill() on those SQL operators that sup...
>> Read more... <https://github.com/apache/spark/pull/13027/>
>>
>>
>>
>> --
>> *From:* Rychnovsky, Dusan
>> *Sent:* Wednesday, August 3, 2016 3:58 PM
>> *To:* Ted Yu
>>
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable
>> to acquire X bytes of memory, got 0
>>
>>
>> Yes, I believe I'm using Spark 1.6.0.
>>
>>
>> > spark-submit --version
>> Welcome to
>>     __
>>  / __/__  ___ _/ /__
>> _\ \/ _ \/ _ `/ __/  '_/
>>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>>   /_/
>>
>> I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and
>> therefore should have it fixed, right? Or what do I do to fix it?
>>
>>
>> Thanks,
>>
>> Dusan
>>
>>
>> --
>> *From:* Ted Yu <yuzhih...@gmail.com>
>> *Sent:* Wednesday, August 3, 2016 3:52 PM
>> *To:* Rychnovsky, Dusan
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable
>> to acquire X bytes of memory, got 0
>>
>> Are you using Spark 1.6+ ?
>>
>> See SPARK-11293
>>
>> On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
>> dusan.rychnov...@firma.seznam.cz> wrote:
>>
>>> Hi,
>>>
>>>
>>> I have a Spark workflow that when run on a relatively small portion of
>>> data works fine, but when run on big data fails with strange errors. In the
>>> log files of failed executors I found the following errors:
>>>
>>>
>>> Firstly
>>>
>>>
>>> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>>>
>>> And then a series of
>>>
>>> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got
>>> 0
>>>
>>> > at
>>> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>>>
>>>
>>> > at
>>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>>>
>>>
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>
>>> > at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>
>>> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>
>>> > at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>
>>> > at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>
>>>
>>> > at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>
>>>
>>> > at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> The job keeps failing in the same way (I tried a few times).
>>>
>>>
>>> What could be causing such error?
>>>
>>> I have a feeling that I'm not providing enough context necessary to
>>> understand the issue. Please ask for any other information needed.
>>>
>>>
>>> Thank you,
>>>
>>> Dusan
>>>
>>>
>>>
>>
>

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu

The latest QA run was no longer accessible (error 404):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59141/consoleFull

Looking at the comments on the PR, there is not enough confidence in
pulling in the fix into 1.6

On Wed, Aug 3, 2016 at 9:05 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> I am confused.
>
>
> I tried to look for Spark that would have this issue fixed, i.e.
> https://github.com/apache/spark/pull/13027/ merged in, but it looks like
> the patch has not been merged for 1.6.
>
>
> How do I get a fixed 1.6 version?
>
>
> Thanks,
>
> Dusan
>
>
> <https://github.com/apache/spark/pull/13027/>
> [SPARK-4452][SPARK-11293][Core][BRANCH-1.6] Shuffle data structures can
> starve others on the same thread for memory by lianhuiwang · Pull Request
> #13027 · apache/spark · GitHub
> What changes were proposed in this pull request? This PR is for the
> branch-1.6 version of the commits PR #10024. In #9241 It implemented a
> mechanism to call spill() on those SQL operators that sup...
> Read more... <https://github.com/apache/spark/pull/13027/>
>
>
>
> ----------
> *From:* Rychnovsky, Dusan
> *Sent:* Wednesday, August 3, 2016 3:58 PM
> *To:* Ted Yu
>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
>
> Yes, I believe I'm using Spark 1.6.0.
>
>
> > spark-submit --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.6.0
>   /_/
>
> I don't understand the ticket. It says "Fixed in 1.6.0". I have 1.6.0 and
> therefore should have it fixed, right? Or what do I do to fix it?
>
>
> Thanks,
>
> Dusan
>
>
> --
> *From:* Ted Yu <yuzhih...@gmail.com>
> *Sent:* Wednesday, August 3, 2016 3:52 PM
> *To:* Rychnovsky, Dusan
> *Cc:* user@spark.apache.org
> *Subject:* Re: Managed memory leak detected + OutOfMemoryError: Unable to
> acquire X bytes of memory, got 0
>
> Are you using Spark 1.6+ ?
>
> See SPARK-11293
>
> On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
> dusan.rychnov...@firma.seznam.cz> wrote:
>
>> Hi,
>>
>>
>> I have a Spark workflow that when run on a relatively small portion of
>> data works fine, but when run on big data fails with strange errors. In the
>> log files of failed executors I found the following errors:
>>
>>
>> Firstly
>>
>>
>> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>>
>> And then a series of
>>
>> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got
>> 0
>>
>> > at
>> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>>
>>
>> > at
>> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>>
>>
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>
>> > at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>
>> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>
>> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>
>> > at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>
>>
>> > at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>
>>
>> > at java.lang.Thread.run(Thread.java:745)
>>
>>
>> The job keeps failing in the same way (I tried a few times).
>>
>>
>> What could be causing such error?
>>
>> I have a feeling that I'm not providing enough context necessary to
>> understand the issue. Please ask for any other information needed.
>>
>>
>> Thank you,
>>
>> Dusan
>>
>>
>>
>

Re: Managed memory leak detected + OutOfMemoryError: Unable to acquire X bytes of memory, got 0

2016-08-03 Thread Ted Yu

Are you using Spark 1.6+ ?

See SPARK-11293

On Wed, Aug 3, 2016 at 5:03 AM, Rychnovsky, Dusan <
dusan.rychnov...@firma.seznam.cz> wrote:

> Hi,
>
>
> I have a Spark workflow that when run on a relatively small portion of
> data works fine, but when run on big data fails with strange errors. In the
> log files of failed executors I found the following errors:
>
>
> Firstly
>
>
> > Managed memory leak detected; size = 263403077 bytes, TID = 6524
>
> And then a series of
>
> > java.lang.OutOfMemoryError: Unable to acquire 241 bytes of memory, got 0
>
> > at
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>
>
> > at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
>
>
> > at
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
>
>
> > at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
>
>
> > at
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
>
>
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>
> > at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>
> > at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
>
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
>
> > at java.lang.Thread.run(Thread.java:745)
>
>
> The job keeps failing in the same way (I tried a few times).
>
>
> What could be causing such error?
>
> I have a feeling that I'm not providing enough context necessary to
> understand the issue. Please ask for any other information needed.
>
>
> Thank you,
>
> Dusan
>
>
>

Re: [2.0.0] mapPartitions on DataFrame unable to find encoder

2016-08-02 Thread Ted Yu

Using spark-shell of master branch:

scala> case class Entry(id: Integer, name: String)
defined class Entry

scala> val df  = Seq((1,"one"), (2, "two")).toDF("id", "name").as[Entry]
16/08/02 16:47:01 DEBUG package$ExpressionCanonicalizer:
=== Result of Batch CleanExpressions ===
!assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)._1 AS _1#10   assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)._1
!+- assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)._1 +- assertnotnull(input[0, scala.Tuple2, true], top level
non-flat input object)
!   +- assertnotnull(input[0, scala.Tuple2, true], top level non-flat input
object)+- input[0, scala.Tuple2, true]
!  +- input[0, scala.Tuple2, true]
...

scala> df.mapPartitions(_.take(1))

On Tue, Aug 2, 2016 at 1:55 PM, Dragisa Krsmanovic 
wrote:

> I am trying to use mapPartitions on DataFrame.
>
> Example:
>
> import spark.implicits._
> val df: DataFrame = Seq((1,"one"), (2, "two")).toDF("id", "name")
> df.mapPartitions(_.take(1))
>
> I am getting:
>
> Unable to find encoder for type stored in a Dataset.  Primitive types
> (Int, String, etc) and Product types (case classes) are supported by
> importing spark.implicits._  Support for serializing other types will be
> added in future releases.
>
> Since DataFrame is Dataset[Row], I was expecting encoder for Row to be
> there.
>
> What's wrong with my code ?
>
>
> --
>
> Dragiša Krsmanović | Platform Engineer | Ticketfly
>
> dragi...@ticketfly.com
>
> @ticketfly  | ticketfly.com/blog |
> facebook.com/ticketfly
>

Re: Extracting key word from a textual column

2016-08-02 Thread Ted Yu

+1

> On Aug 2, 2016, at 2:29 PM, Jörn Franke  wrote:
> 
> If you need to use single inserts, updates, deletes, select why not use hbase 
> with Phoenix? I see it as complementary to the hive / warehouse offering 
> 
>> On 02 Aug 2016, at 22:34, Mich Talebzadeh  wrote:
>> 
>> Hi,
>> 
>> I decided to create a catalog table in Hive ORC and transactional. That 
>> table has two columns of value
>> 
>> transactiondescription === account_table.transactiondescription
>> hashtag String column created from a semi automated process of deriving it 
>> from account_table.transactiondescription
>> Once the process is complete in populating the catalog table then we just 
>> need to create a new DF based on join between catalog table and the 
>> account_table. The join will use hashtag in catalog table to loop over debit 
>> column in account_table for a given hashtag. That is pretty fast as going 
>> through pattern matching is pretty intensive in any application and database 
>> in real time.
>> 
>> So one can build up the catalog table over time as a reference table. I am 
>> sure such tables exist in commercial world.
>> 
>> Anyway after getting results out I know how I am wasting my money on 
>> different things, especially on clothing  etc :)
>> 
>> 
>> HTH
>> 
>> P.S. Also there is an issue with Spark not being able to read data through 
>> Hive transactional tables that have not been compacted yet. Spark just 
>> crashes. If these tables need to be updated regularly say catalog table and 
>> they are pretty small, one might maintain them in an RDBMS and read them 
>> once through JDBC into a DataFrame in Spark before doing analytics.
>> 
>> 
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> http://talebzadehmich.wordpress.com
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>>> On 2 August 2016 at 17:56, Sonal Goyal  wrote:
>>> Hi Mich,
>>> 
>>> It seems like an entity resolution problem - looking at different 
>>> representations of an entity - SAINSBURY in this case and matching them all 
>>> together. How dirty is your data in the description - are there stop words 
>>> like SACAT/SMKT etc you can strip off and get the base retailer entity ?
>>> 
>>> Best Regards,
>>> Sonal
>>> Founder, Nube Technologies 
>>> Reifier at Strata Hadoop World
>>> Reifier at Spark Summit 2015
>>> 
>>> 
>>> 
>>> 
>>> 
 On Tue, Aug 2, 2016 at 9:55 PM, Mich Talebzadeh 
  wrote:
 Thanks.
 
 I believe there is some catalog of companies that I can get and store it 
 in a table and math the company name to transactiondesciption column.
 
 That catalog should have sectors in it. For example company XYZ is under 
 Grocers etc which will make search and grouping much easier.
 
 I believe Spark can do it, though I am generally interested on alternative 
 ideas.
 
 
 
 
 
 Dr Mich Talebzadeh
  
 LinkedIn  
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
  
 http://talebzadehmich.wordpress.com
 
 Disclaimer: Use it at your own risk. Any and all responsibility for any 
 loss, damage or destruction of data or any other property which may arise 
 from relying on this email's technical content is explicitly disclaimed. 
 The author will in no case be liable for any monetary damages arising from 
 such loss, damage or destruction.
  
 
> On 2 August 2016 at 16:26, Yong Zhang  wrote:
> Well, if you still want to use windows function for your logic, then you 
> need to derive a new column out, like "catalog", and use it as part of 
> grouping logic.
> 
> 
> Maybe you can use regex for deriving out this new column. The 
> implementation needs to depend on your data in "transactiondescription", 
> and regex gives you the most powerful way to handle your data.
> 
> 
> This is really not a Spark question, but how to you process your logic 
> based on the data given.
> 
> 
> Yong
> 
> 
> From: Mich Talebzadeh 
> Sent: Tuesday, August 2, 2016 10:00 AM
> To: user @spark
> Subject: Extracting key word from a textual column
>  
> Hi,
> 
> Need some ideas.
> 
> Summary:
> 
> I am working on a tool to slice and dice the amount of money I have spent 
> so far (meaning the whole data sample) on a given retailer so I have a 
> better idea of where I

Re: Job can not terminated in Spark 2.0 on Yarn

2016-08-02 Thread Ted Yu

Which hadoop version are you using ?

Can you show snippet of your code ?

Thanks

On Tue, Aug 2, 2016 at 10:06 AM, Liangzhao Zeng 
wrote:

> Hi,
>
>
> I migrate my code to Spark 2.0 from 1.6. It finish last stage (and result is 
> correct) but get following errors then start over.
>
>
> Any idea on what happen?
>
>
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(2,WrappedArray())
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(115,WrappedArray())
> 16/08/02 16:59:33 ERROR scheduler.LiveListenerBus: SparkListenerBus has 
> already stopped! Dropping event 
> SparkListenerExecutorMetricsUpdate(70,WrappedArray())
> 16/08/02 16:59:33 WARN yarn.YarnAllocator: Expected to find pending requests, 
> but found none.
> 16/08/02 16:59:33 WARN netty.Dispatcher: Message 
> RemoteProcessDisconnected(17.138.53.26:55338) dropped. Could not find 
> MapOutputTracker.
>
>
>
> Cheers,
>
>
> LZ
>
>

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-08-01 Thread Ted Yu

Have you seen the following ?
http://stackoverflow.com/questions/27553547/xloggc-not-creating-log-file-if-path-doesnt-exist-for-the-first-time

On Sat, Jul 23, 2016 at 5:18 PM, Ascot Moss <ascot.m...@gmail.com> wrote:

> I tried to add -Xloggc:./jvm_gc.log
>
> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -Xloggc:./jvm_gc.log -XX:+PrintGCDateStamps"
>
> however, I could not find ./jvm_gc.log
>
> How to resolve the OOM and gc log issue?
>
> Regards
>
> On Sun, Jul 24, 2016 at 6:37 AM, Ascot Moss <ascot.m...@gmail.com> wrote:
>
>> My JDK is Java 1.8 u40
>>
>> On Sun, Jul 24, 2016 at 3:45 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Since you specified +PrintGCDetails, you should be able to get some
>>> more detail from the GC log.
>>>
>>> Also, which JDK version are you using ?
>>>
>>> Please use Java 8 where G1GC is more reliable.
>>>
>>> On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss <ascot.m...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I added the following parameter:
>>>>
>>>> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
>>>> -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
>>>> -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
>>>> -XX:+PrintGCTimeStamps"
>>>>
>>>> Still got Java heap space error.
>>>>
>>>> Any idea to resolve?  (my spark is 1.6.1)
>>>>
>>>>
>>>> 16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID
>>>> 22, n1791): java.lang.OutOfMemoryError: Java heap space   at
>>>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>>>>
>>>> at
>>>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
>>>>
>>>> at
>>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)
>>>>
>>>> at
>>>> org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)
>>>>
>>>> at
>>>> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
>>>> at
>>>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>>>
>>>> at
>>>> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>>>>
>>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>>
>>>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>
>>>> at
>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>>
>>>> at
>>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>>>
>>>> at
>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>>>
>>>> at
>>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>>>
>>>> at 
>>>> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>>>>
>>>> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>>>>
>>>> at
>>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>>>
>>>> at
>>>> scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>>>>
>>>> at
>>>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>>>>
>>>> at
>>>> scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>>>>
>>>> at
>>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>>>
>>>> at
>>>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>>>>
>>>> at
>>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>>>
>>>> at
>>>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>>>>
>>>> at
>>

Re: JettyUtils.createServletHandler Method not Found?

2016-08-01 Thread Ted Yu

Original discussion was about Spark 1.3

Which Spark release are you using ?

Cheers

On Mon, Aug 1, 2016 at 1:37 AM, bg_spark <1412743...@qq.com> wrote:

> hello,I have the same problem like you,  how do you solve the problem?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/JettyUtils-createServletHandler-Method-not-Found-tp22262p27446.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: ERROR Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.

2016-07-23 Thread Ted Yu

Since you specified +PrintGCDetails, you should be able to get some more
detail from the GC log.

Also, which JDK version are you using ?

Please use Java 8 where G1GC is more reliable.

On Sat, Jul 23, 2016 at 10:38 AM, Ascot Moss  wrote:

> Hi,
>
> I added the following parameter:
>
> --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
> -XX:MaxGCPauseMillis=200 -XX:ParallelGCThreads=20 -XX:ConcGCThreads=5
> -XX:InitiatingHeapOccupancyPercent=70 -XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps"
>
> Still got Java heap space error.
>
> Any idea to resolve?  (my spark is 1.6.1)
>
>
> 16/07/23 23:31:50 WARN TaskSetManager: Lost task 1.0 in stage 6.0 (TID 22,
> n1791): java.lang.OutOfMemoryError: Java heap space   at
> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>
> at
> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:136)
>
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:248)
>
> at
> org.apache.spark.util.collection.CompactBuffer.toArray(CompactBuffer.scala:30)
>
> at
> org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$findSplits$1(DecisionTree.scala:1009)
> at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>
> at
> org.apache.spark.mllib.tree.DecisionTree$$anonfun$29.apply(DecisionTree.scala:1042)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Regards
>
>
>
> On Sat, Jul 23, 2016 at 9:49 AM, Ascot Moss  wrote:
>
>> Thanks. Trying with extra conf now.
>>
>> On Sat, Jul 23, 2016 at 6:59 AM, RK Aduri 
>> wrote:
>>
>>> I can see large number of collections happening on driver and
>>> eventually, driver is running out of memory. ( am not sure whether you have
>>> persisted any rdd or data frame). May be you would want to avoid doing so
>>> many collections or persist unwanted data in memory.
>>>
>>> To begin with, you may want to re-run the job with this following
>>> config: --conf "spark.executor.extraJavaOptions=-XX:+UseG1GC
>>> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps” —> and this will give you an
>>> idea of how you are getting OOM.
>>>
>>>
>>> On Jul 22, 2016, at 3:52 PM, Ascot Moss  wrote:
>>>
>>> Hi
>>>
>>> Please help!
>>>
>>>  When running random forest training phase in cluster mode, I got GC
>>> overhead limit exceeded.
>>>
>>> I have used two parameters when submitting the job to cluster
>>>
>>> --driver-memory 64g \
>>>
>>> --executor-memory 8g \
>>>
>>> My Current settings:
>>>
>>> (spark-defaults.conf)
>>>
>>> spark.executor.memory   8g
>>>
>>> (spark-env.sh)
>>>
>>> export SPARK_WORKER_MEMORY=8g
>>>
>>> export HADOOP_HEAPSIZE=8000
>>>
>>>
>>> Any idea how to resolve it?
>>>
>>> Regards
>>>
>>>
>>>
>>>
>>>
>>>
>>> ###  (the erro log) ###
>>>
>>> 16/07/23 04:34:04 WARN TaskSetManager: Lost task 2.0 in stage 6.1 (TID
>>> 30, n1794): java.lang.OutOfMemoryError: GC overhead limit exceeded
>>>
>>> at
>>> scala.reflect.ManifestFactory$$anon$12.newArray(Manifest.scala:138)
>>>
>>> at
>>>

Re: Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError: Java heap space

2016-07-22 Thread Ted Yu

How much heap memory do you give the driver ?

On Fri, Jul 22, 2016 at 2:17 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Given I get a stack trace in my python notebook I am guessing the driver
> is running out of memory?
>
> My app is simple it creates a list of dataFrames from s3://, and counts
> each one. I would not think this would take a lot of driver memory
>
> I am not running my code locally. Its using 12 cores. Each node has 6G.
>
> Any suggestions would be greatly appreciated
>
> Andy
>
> def work():
>
> constituentDFS = getDataFrames(constituentDataSets)
>
> results = ["{} {}".format(name, constituentDFS[name].count()) for name
> in constituentDFS]
>
> print(results)
>
> return results
>
>
> %timeit -n 1 -r 1 results = work()
>
>
>  in (.0)  1 def work():  2
>  constituentDFS = getDataFrames(constituentDataSets)> 3 results = 
> ["{} {}".format(name, constituentDFS[name].count()) for name in 
> constituentDFS]  4 print(results)  5 return results
>
>
> 16/07/22 17:54:38 WARN TaskSetManager: Stage 146 contains a task of very
> large size (145 KB). The maximum recommended task size is 100 KB.
>
> 16/07/22 18:39:47 WARN HeartbeatReceiver: Removing executor 2 with no
> recent heartbeats: 153037 ms exceeds timeout 12 ms
>
> Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError:
> Java heap space
>
> at java.util.jar.Manifest$FastInputStream.(Manifest.java:332)
>
> at java.util.jar.Manifest$FastInputStream.(Manifest.java:327)
>
> at java.util.jar.Manifest.read(Manifest.java:195)
>
> at java.util.jar.Manifest.(Manifest.java:69)
>
> at java.util.jar.JarFile.getManifestFromReference(JarFile.java:199)
>
> at java.util.jar.JarFile.getManifest(JarFile.java:180)
>
> at sun.misc.URLClassPath$JarLoader$2.getManifest(URLClassPath.java:944)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:450)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.logExecutorLoss(TaskSchedulerImpl.scala:510)
>
> at
> org.apache.spark.scheduler.TaskSchedulerImpl.executorLost(TaskSchedulerImpl.scala:473)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:199)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$org$apache$spark$HeartbeatReceiver$$expireDeadHosts$3.apply(HeartbeatReceiver.scala:195)
>
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>
> at
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> at org.apache.spark.HeartbeatReceiver.org
> $apache$spark$HeartbeatReceiver$$expireDeadHosts(HeartbeatReceiver.scala:195)
>
> at
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:118)
>
> at
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:104)
>
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
>
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>
> 16/07/22 19:08:29 WARN NettyRpcEnv: Ignored message: true
>
>
>

Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu

You can use this command (assuming log aggregation is turned on):

yarn logs --applicationId XX

In the log, you should see snippet such as the following:

java.class.path=...

FYI

On Thu, Jul 21, 2016 at 9:38 PM, Ilya Ganelin <ilgan...@gmail.com> wrote:

> what's the easiest way to get the Classpath for the spark application
> itself?
>
> On Thu, Jul 21, 2016 at 9:37 PM Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Might be classpath issue.
>>
>> Mind pastebin'ning the effective class path ?
>>
>> Stack trace of NoClassDefFoundError may also help provide some clue.
>>
>> On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin <ilgan...@gmail.com> wrote:
>>
>>> Hello - I'm trying to deploy the Spark TimeSeries library in a new
>>> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
>>> with Java 8 installed on all nodes but I'm getting the NoClassDef at
>>> runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
>>> part of Java 8 I feel like I shouldn't need to do anything else. The weird
>>> thing is I get it on the data nodes, not the driver. Any thoughts on what's
>>> causing this or how to track it down? Would appreciate the help.
>>>
>>> Thanks!
>>>
>>
>>

Re: NoClassDefFoundError with ZonedDateTime

2016-07-21 Thread Ted Yu

Might be classpath issue.

Mind pastebin'ning the effective class path ?

Stack trace of NoClassDefFoundError may also help provide some clue.

On Thu, Jul 21, 2016 at 8:26 PM, Ilya Ganelin  wrote:

> Hello - I'm trying to deploy the Spark TimeSeries library in a new
> environment. I'm running Spark 1.6.1 submitted through YARN in a cluster
> with Java 8 installed on all nodes but I'm getting the NoClassDef at
> runtime when trying to create a new TimeSeriesRDD. Since ZonedDateTime is
> part of Java 8 I feel like I shouldn't need to do anything else. The weird
> thing is I get it on the data nodes, not the driver. Any thoughts on what's
> causing this or how to track it down? Would appreciate the help.
>
> Thanks!
>

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-20 Thread Ted Yu

You can decide which component(s) to use for storing your data.
If you haven't used hbase before, it may be better to store data on hdfs
and query through Hive or SparkSQL.

Maintaining hbase is not trivial task, especially when the cluster size is
large.

How much data are you expecting to be written on a daily / weekly basis ?

Cheers

On Wed, Jul 20, 2016 at 7:22 AM, Yu Wei  wrote:

> I'm beginner to big data. I don't have too much knowledge about hbase/hive.
>
> What's the difference between hbase and hive/hdfs for storing data for
> analytics?
>
>
> Thanks,
>
> Jared
> --
> *From:* ayan guha 
> *Sent:* Wednesday, July 20, 2016 9:34:24 PM
> *To:* Rabin Banerjee
> *Cc:* user; Yu Wei; Deepak Sharma
>
> *Subject:* Re: Is it good choice to use DAO to store results generated by
> spark application?
>
>
> Just as a rain check, saving data to hbase for analytics may not be the
> best choice. Any specific reason for not using hdfs or hive?
> On 20 Jul 2016 20:57, "Rabin Banerjee" 
> wrote:
>
>> Hi Wei ,
>>
>> You can do something like this ,
>>
>> foreachPartition( (part) => {val conn = 
>> ConnectionFactory.createConnection(HBaseConfiguration.create());val 
>> table = conn.getTable(TableName.valueOf(tablename));
>> //part.foreach((inp)=>{println(inp);table.put(inp)}) //This is line by line 
>> put  table.put(part.toList.asJava)table.close();conn.close();
>>
>>
>> \
>>
>> Now if you want to wrap it inside a DAO,its upto you. Making DAO will
>> abstract thing , but ultimately going to use the same code .
>>
>> Note: Always use Hbase ConnectionFactory to get connection ,and dump data
>> per partition basis.
>>
>> Regards,
>> Rabin Banerjee
>>
>>
>> On Wed, Jul 20, 2016 at 12:06 PM, Yu Wei  wrote:
>>
>>> I need to write all data received from MQTT data into hbase for further
>>> processing.
>>>
>>> They're not final result.  I also need to read the data from hbase for
>>> analysis.
>>>
>>>
>>> Is it good choice to use DAO in such situation?
>>>
>>>
>>> Thx,
>>>
>>> Jared
>>>
>>>
>>> --
>>> *From:* Deepak Sharma 
>>> *Sent:* Wednesday, July 20, 2016 12:34:07 PM
>>> *To:* Yu Wei
>>> *Cc:* spark users
>>> *Subject:* Re: Is it good choice to use DAO to store results generated
>>> by spark application?
>>>
>>>
>>> I am using DAO in spark application to write the final computation to
>>> Cassandra  and it performs well.
>>> What kinds of issues you foresee using DAO for hbase ?
>>>
>>> Thanks
>>> Deepak
>>>
>>> On 19 Jul 2016 10:04 pm, "Yu Wei"  wrote:
>>>
 Hi guys,


 I write spark application and want to store results generated by spark
 application to hbase.

 Do I need to access hbase via java api directly?

 Or is it better choice to use DAO similar as traditional RDBMS?  I
 suspect that there is major performance downgrade and other negative
 impacts using DAO. However, I have little knowledge in this field.


 Any advice?


 Thanks,

 Jared




>>

Re: Is it good choice to use DAO to store results generated by spark application?

2016-07-19 Thread Ted Yu

hbase-spark module is in the up-coming hbase 2.0 release.
Currently it is in master branch of hbase git repo.

FYI

On Tue, Jul 19, 2016 at 8:27 PM, Andrew Ehrlich  wrote:

> There is a Spark<->HBase library that does this.  I used it once in a
> prototype (never tried in production through):
> http://blog.cloudera.com/blog/2015/08/apache-spark-comes-to-apache-hbase-with-hbase-spark-module/
>
> On Jul 19, 2016, at 9:34 AM, Yu Wei  wrote:
>
> Hi guys,
>
> I write spark application and want to store results generated by spark
> application to hbase.
> Do I need to access hbase via java api directly?
> Or is it better choice to use DAO similar as traditional RDBMS?  I suspect
> that there is major performance downgrade and other negative impacts using
> DAO. However, I have little knowledge in this field.
>
> Any advice?
>
> Thanks,
> Jared
>
>
>

Re: Missing Exector Logs From Yarn After Spark Failure

2016-07-19 Thread Ted Yu

What's the value for yarn.log-aggregation.retain-seconds
and yarn.log-aggregation-enable ?

Which hadoop release are you using ?

Thanks

On Tue, Jul 19, 2016 at 3:23 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I am trying to find the root cause of recent Spark application failure in
> production. When the Spark application is running I can check NodeManager's
> yarn.nodemanager.log-dir property to get the Spark executor container logs.
>
> The container has logs for both the running Spark applications
>
> Here is the view of the container logs: drwx--x--- 3 yarn yarn 51 Jul 19
> 09:04 application_1467068598418_0209 drwx--x--- 5 yarn yarn 141 Jul 19
> 09:04 application_1467068598418_0210
>
> But when the application is killed both the application logs are
> automatically deleted. I have set all the log retention setting etc in Yarn
> to a very large number. But still these logs are deleted as soon as the
> Spark applications are crashed.
>
> Question: How can we retain these Spark application logs in Yarn for
> debugging when the Spark application is crashed for some reason.
>

Re: I'm trying to understand how to compile Spark

2016-07-19 Thread Ted Yu

org.apache.spark.mllib.fpm is not a maven goal.

-pl is For Individual Projects.

Your first build action should not include -pl.


On Tue, Jul 19, 2016 at 4:22 AM, Eli Super  wrote:

> Hi
>
> I have a windows laptop
>
> I just downloaded the spark 1.4.1 source code.
>
> I try to compile *org.apache.spark.mllib.fpm* with *mvn *
>
> My goal is to replace *original *org\apache\spark\mllib\fpm\* in
> *spark-assembly-1.4.1-hadoop2.6.0.jar*
>
> As I understand from this link
>
>
> https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-Eclipse
>
>
> I need to execute following command : build/mvn package -DskipTests -pl
> assembly
> I executed : mvn org.apache.spark.mllib.fpm  -DskipTests -pl assembly
>
> Then I got an error
>  [INFO] Scanning for projects...
> [ERROR] [ERROR] Could not find the selected project in the reactor:
> assembly @
>
> Thanks for any help
>
>
>
>

Re: Spark ResourceLeak??

2016-07-19 Thread Ted Yu

ResourceLeakDetector doesn't seem to be from Spark.

Please check dependencies for potential leak.

Cheers

On Tue, Jul 19, 2016 at 6:11 AM, Guruji  wrote:

> I am running a Spark Cluster on Mesos. The module reads data from Kafka as
> DirectStream and pushes it into elasticsearch after referring to a redis
> for
> getting Names against IDs.
>
> I have been getting this message in my worker logs.
>
> *16/07/19 11:17:44 ERROR ResourceLeakDetector: LEAK: You are creating too
> many HashedWheelTimer instances.  HashedWheelTimer is a shared resource
> that
> must be reused across the JVM,so that only a few instances are created.
> *
>
> Can't figure out the reason for the Resource Leak. Although when this
> happens, the Batches start slowing down and the pending Queue increases.
> There is hardly going back from there, other than killing it and starting
> it
> again.
>
> Any idea why the resource leak? This message seems to be related to akka
> when I googled. I am using Spark 1.6.2.
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-ResourceLeak-tp27355.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Input path does not exist error in giving input file for word count program

2016-07-15 Thread Ted Yu

>From 
>examples/src/main/scala/org/apache/spark/examples/streaming/HdfsWordCount.scala
:

val lines = ssc.textFileStream(args(0))
val words = lines.flatMap(_.split(" "))

In your case, looks like inputfile didn't correspond to an existing path.

On Fri, Jul 15, 2016 at 1:05 AM, RK Spark  wrote:

> val count = inputfile.flatMap(line => line.split(" ")).map(word =>
> (word,1)).reduceByKey(_ + _);
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
>

Re: Call http request from within Spark

2016-07-14 Thread Ted Yu

Second to what Pedro said in the second paragraph.

Issuing http request per row would not scale.

On Thu, Jul 14, 2016 at 12:26 PM, Pedro Rodriguez 
wrote:

> Hi Amit,
>
> Have you tried running a subset of the IDs locally on a single thread? It
> would be useful to benchmark your getProfile function for a subset of the
> data then estimate how long the full data set would take then divide by
> number of spark executor cores. This should at least serve as a sanity
> check. If things are much slower than expected is it possible that the
> service has a rate limit per ip address that you are hitting?
>
> If requests is more efficient at batching requests together (I don't know
> much about its internal implementation and connection pools) you could do
> that with mapPartitions. This is useful when the initialization time of the
> function in the map call is expensive (eg uses a connection pool for a db
> or web) as it allows you to initialize that resource once per partition
> then reuse it for all the elements in the partition.
>
> Pedro
>
> On Thu, Jul 14, 2016 at 8:52 AM, Amit Dutta 
> wrote:
>
>> Hi All,
>>
>>
>> I have a requirement to call a rest service url for 300k customer ids.
>>
>> Things I have tried so far is
>>
>>
>> custid_rdd = sc.textFile('file:Users/zzz/CustomerID_SC/Inactive User
>> Hashed LCID List.csv') #getting all the customer ids and building adds
>>
>> profile_rdd = custid_rdd.map(lambda r: getProfile(r.split(',')[0]))
>>
>> profile_rdd.count()
>>
>>
>> #getprofile is the method to do the http call
>>
>> def getProfile(cust_id):
>>
>> api_key = 'txt'
>>
>> api_secret = 'yuyuy'
>>
>> profile_uri = 'https://profile.localytics.com/x1/customers/{}'
>>
>> customer_id = cust_id
>>
>>
>> if customer_id is not None:
>>
>> data = requests.get(profile_uri.format(customer_id),
>> auth=requests.auth.HTTPBasicAuth(api_key, api_secret))
>>
>> # print json.dumps(data.json(), indent=4)
>>
>> return data
>>
>>
>> when I print the json dump of the data i see it returning results from
>> the rest call. But the count never stops.
>>
>>
>> Is there an efficient way of dealing this? Some post says we have to
>> define a batch size etc but don't know how.
>>
>>
>> Appreciate your help
>>
>>
>> Regards,
>>
>> Amit
>>
>>
>
>
> --
> Pedro Rodriguez
> PhD Student in Distributed Machine Learning | CU Boulder
> UC Berkeley AMPLab Alumni
>
> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423
> Github: github.com/EntilZha | LinkedIn:
> https://www.linkedin.com/in/pedrorodriguezscience
>
>

Re: Issue in spark job. Remote rpc client dissociated

2016-07-13 Thread Ted Yu

Which Spark release are you using ?

Can you disclose what the folder processing does (code snippet is better) ?

Thanks

On Wed, Jul 13, 2016 at 9:44 AM, Balachandar R.A. 
wrote:

> Hello
>
> In one of my use cases, i need to process list of folders in parallel. I
> used
> Sc.parallelize (list,list.size).map(" logic to process the folder").
> I have a six node cluster and there are six folders to process.  Ideally i
> expect that each of my node process one folder.  But,  i see that a node
> process multiple folders while one or two of the nodes do not get any job.
> In the end, the spark- submit crashes with the exception saying "remote RPC
> client dissociated". Can someone give me a hint on what's going wrong here?
> Please note that this issue does not arise if i comment my logic that
> processes the folder but simply print folder name. In this case,  every
> node gets one folder to process.  I inserted a sleep of 40 seconds inside
> the map. No issue. But when i uncomment my logic i see this issue. Also,
> before crashing it does process some of the folders successfully.
> Successfully means the business logic generates a file in a shared file
> system.
>
> Regards
> Bala
>

Re: Optimize filter operations with sorted data

2016-07-07 Thread Ted Yu

Does the filter under consideration operate on sorted column(s) ?

Cheers

> On Jul 7, 2016, at 2:25 AM, tan shai  wrote:
> 
> Hi, 
> 
> I have a sorted dataframe, I need to optimize the filter operations.
> How does Spark performs filter operations on sorted dataframe? 
> 
> It is scanning all the data? 
> 
> Many thanks. 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Saving parquet table as uncompressed with write.mode("overwrite").

2016-07-03 Thread Ted Yu

Have you tried the following (note the extraneous dot in your config name) ?

val c = sqlContext.setConf("spark.sql.parquet.compression.codec", "none")

Also, parquet() has compression parameter which defaults to None

FYI

On Sun, Jul 3, 2016 at 2:42 PM, Mich Talebzadeh 
wrote:

> Hi,
>
> I simply read a Parquet table
>
> scala> val s = sqlContext.read.parquet("oraclehadoop.sales2")
> s: org.apache.spark.sql.DataFrame = [prod_id: bigint, cust_id: bigint,
> time_id: timestamp, channel_id: bigint, promo_id: bigint, quantity_sold:
> decimal(10,0), amount_sold: decimal(10,0)]
>
> Now all I want is to save data and make it uncompressed. By default it
> saves the table as *gzipped*
>
> val s4 = s.write.mode("overwrite").parquet("/user/hduser/sales4")
>
> However, I want use this approach without creating table explicitly myself
> with sqlContext etc
>
> This does not seem to work
>
> val c = sqlContext.setConf("spark.sql.parquet.compression.codec.",
> "uncompressed")
>
> Can I do through a method on DataFrame "s" above to make the table saved
> as uncompressed?
>
> Thanks,
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>

Re: Spark driver assigning splits to incorrect workers

2016-07-01 Thread Ted Yu

I guess you extended some InputFormat for providing locality information.

Can you share some code snippet ?

Which non-distributed file system are you using ?

Thanks

On Fri, Jul 1, 2016 at 2:46 PM, Raajen  wrote:

> I would like to use Spark on a non-distributed file system but am having
> trouble getting the driver to assign tasks to the workers that are local to
> the files. I have extended InputSplits to create my own version of
> FileSplits, so that each worker gets a bit more information than the
> default
> FileSplit provides. I thought that the driver would assign splits based on
> their locality. But I have found that the driver will send these splits to
> workers seemingly at random -- even the very first split will go to a node
> with a different IP than the split specifies. I can see that I am providing
> the right node address via GetLocations. I also set spark.locality.wait to
> a
> high value, but the same misassignment keeps happening.
>
> I am using newAPIHadoopFile to create my RDD. InputFormat is creating the
> required splits, but not all splits refer to the same file or the same
> worker IP.
>
> What else I can check, or change, to force the driver to send these tasks
> to
> the right workers?
>
> Thanks!
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-driver-assigning-splits-to-incorrect-workers-tp27261.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Why so many parquet file part when I store data in Alluxio or File?

2016-06-30 Thread Ted Yu

Looking under Alluxio source, it seems only "fs.hdfs.impl.disable.cache" is
in use.

FYI

On Thu, Jun 30, 2016 at 9:30 PM, Deepak Sharma 
wrote:

> Ok.
> I came across this issue.
> Not sure if you already assessed this:
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6921
>
> The workaround mentioned may work for you .
>
> Thanks
> Deepak
> On 1 Jul 2016 9:34 am, "Chanh Le"  wrote:
>
>> Hi Deepark,
>> Thank for replying. The way to write into alluxio is
>> df.write.mode(SaveMode.Append).partitionBy("network_id", "time").parquet(
>> "alluxio://master1:1/FACT_ADMIN_HOURLY”)
>>
>>
>> I partition by 2 columns and store. I just want when I write it automatic
>> write a size properly for what I already set in Alluxio 512MB per block.
>>
>>
>> On Jul 1, 2016, at 11:01 AM, Deepak Sharma  wrote:
>>
>> Before writing coalesing your rdd to 1 .
>> It will create only 1 output file .
>> Multiple part file happens as all your executors will be writing their
>> partitions to separate part files.
>>
>> Thanks
>> Deepak
>> On 1 Jul 2016 8:01 am, "Chanh Le"  wrote:
>>
>> Hi everyone,
>> I am using Alluxio for storage. But I am little bit confuse why I am do
>> set block size of alluxio is 512MB and my file part only few KB and too
>> many part.
>> Is that normal? Because I want to read it fast? Is that many part effect
>> the read operation?
>> How to set the size of file part?
>>
>> Thanks.
>> Chanh
>>
>>
>>
>>
>>
>> 
>>
>>
>>

Re: Spark master shuts down when one of zookeeper dies

2016-06-30 Thread Ted Yu

Looking at Master.scala, I don't see code that would bring master back up
automatically.
Probably you can implement monitoring tool so that you get some alert when
master goes down.

e.g.
http://stackoverflow.com/questions/12896998/how-to-set-up-alerts-on-ganglia

More experienced users may have better suggestion.

On Thu, Jun 30, 2016 at 2:09 AM, vimal dinakaran <vimal3...@gmail.com>
wrote:

> Hi Ted,
>  Thanks for the pointers. I had a three node zookeeper setup . Now the
> master alone dies when  a zookeeper instance is down and a new master is
> elected as leader and the cluster is up.
> But the master that was down , never comes up.
>
> Is this the expected ? Is there a way to get alert when a master is down ?
> How to make sure that there is atleast one back up master is up always ?
>
> Thanks
> Vimal
>
>
>
>
> On Tue, Jun 28, 2016 at 7:24 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Please see some blog w.r.t. the number of nodes in the quorum:
>>
>>
>> http://stackoverflow.com/questions/13022244/zookeeper-reliability-three-versus-five-nodes
>>
>> http://www.ibm.com/developerworks/library/bd-zookeeper/
>>   the paragraph starting with 'A quorum is represented by a strict
>> majority of nodes'
>>
>> FYI
>>
>> On Tue, Jun 28, 2016 at 5:52 AM, vimal dinakaran <vimal3...@gmail.com>
>> wrote:
>>
>>> I am using zookeeper for providing HA for spark cluster.  We have two
>>> nodes zookeeper cluster.
>>>
>>> When one of the zookeeper dies then the entire spark cluster goes down .
>>>
>>> Is this expected behaviour ?
>>> Am I missing something in config ?
>>>
>>> Spark version - 1.6.1.
>>> Zookeeper version - 3.4.6
>>> // spark-env.sh
>>> SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
>>> -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
>>>
>>> Below is the log from spark master:
>>> ZooKeeperLeaderElectionAgent: We have lost leadership
>>> 16/06/27 09:39:30 ERROR Master: Leadership has been revoked -- master
>>> shutting down.
>>>
>>> Thanks
>>> Vimal
>>>
>>>
>>>
>>>
>>
>

Metadata for the StructField

2016-06-29 Thread Ted Yu

You can specify Metadata for the StructField :

case class StructField(
name: String,
dataType: DataType,
nullable: Boolean = true,
metadata: Metadata = Metadata.empty) {

FYI

On Wed, Jun 29, 2016 at 2:50 AM, pooja mehta  wrote:

> Hi,
>
> Want to add a metadata field to StructField case class in spark.
>
> case class StructField(name: String)
>
> And how to carry over the metadata in query execution.
>
>
>
>

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu

Since the data.length is variable, I am not sure whether mixing data.length
and the index makes sense.

Can you describe your use case in bit more detail ?

Thanks

On Tue, Jun 28, 2016 at 11:34 AM, Punit Naik <naik.puni...@gmail.com> wrote:

> Hi Ted
>
> So would the tuple look like: (x._1, split.startIndex + x._2 +
> x._1.length) ?
>
> On Tue, Jun 28, 2016 at 11:09 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Please take a look at:
>> core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala
>>
>> In compute() method:
>> val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
>> firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
>>   (x._1, split.startIndex + x._2)
>>
>> You can modify the second component of the tuple to take data.length
>> into account.
>>
>> On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik <naik.puni...@gmail.com>
>> wrote:
>>
>>> Hi
>>>
>>> I wanted to change the functioning of the "zipWithIndex" function for
>>> spark RDDs in which the output of the function is, just for an example,
>>>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>>>
>>> How can I do this?
>>>
>>> --
>>> Thank You
>>>
>>> Regards
>>>
>>> Punit Naik
>>>
>>
>>
>
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Re: Modify the functioning of zipWithIndex function for RDDs

2016-06-28 Thread Ted Yu

Please take a look at:
core/src/main/scala/org/apache/spark/rdd/ZippedWithIndexRDD.scala

In compute() method:
val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition]
firstParent[T].iterator(split.prev, context).zipWithIndex.map { x =>
  (x._1, split.startIndex + x._2)

You can modify the second component of the tuple to take data.length into
account.

On Tue, Jun 28, 2016 at 10:31 AM, Punit Naik  wrote:

> Hi
>
> I wanted to change the functioning of the "zipWithIndex" function for
> spark RDDs in which the output of the function is, just for an example,
>  "(data, prev_index+data.length)" instead of "(data,prev_index+1)".
>
> How can I do this?
>
> --
> Thank You
>
> Regards
>
> Punit Naik
>

Re: Spark master shuts down when one of zookeeper dies

2016-06-28 Thread Ted Yu

Please see some blog w.r.t. the number of nodes in the quorum:

http://stackoverflow.com/questions/13022244/zookeeper-reliability-three-versus-five-nodes

http://www.ibm.com/developerworks/library/bd-zookeeper/
  the paragraph starting with 'A quorum is represented by a strict majority
of nodes'

FYI

On Tue, Jun 28, 2016 at 5:52 AM, vimal dinakaran 
wrote:

> I am using zookeeper for providing HA for spark cluster.  We have two
> nodes zookeeper cluster.
>
> When one of the zookeeper dies then the entire spark cluster goes down .
>
> Is this expected behaviour ?
> Am I missing something in config ?
>
> Spark version - 1.6.1.
> Zookeeper version - 3.4.6
> // spark-env.sh
> SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER
> -Dspark.deploy.zookeeper.url=zk1:2181,zk2:2181"
>
> Below is the log from spark master:
> ZooKeeperLeaderElectionAgent: We have lost leadership
> 16/06/27 09:39:30 ERROR Master: Leadership has been revoked -- master
> shutting down.
>
> Thanks
> Vimal
>
>
>
>

Re: Utils and Logging cannot be accessed in package ....

2016-06-27 Thread Ted Yu

AFAICT Utils is private:

private[spark] object Utils extends Logging {

So is Logging:

private[spark] trait Logging {

FYI

On Mon, Jun 27, 2016 at 8:20 AM, Paolo Patierno  wrote:

> Hello,
>
> I'm trying to use the Utils.createTempDir() method importing
> org.apache.spark.util.Utils but the scala compiler says me that :
>
> object Utils in package util cannot be accessed in package
> org.apache.spark.util
>
> I'm facing the same problem with Logging.
>
> My sbt file has following dependency :
>
> "org.apache.spark" %% "spark-core" % sparkVersion.value % "provided"
> classifier "tests"
>
> where spark version is "2.0.0-SNAPSHOT".
>
> Any ideas about this problem ?
>
> Thanks,
> Paolo.
>

Re: Arrays in Datasets (1.6.1)

2016-06-27 Thread Ted Yu

Can you show the stack trace for encoding error(s) ?

Have you looked at the following test which involves NestedArray of
primitive type ?

./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala

Cheers

On Mon, Jun 27, 2016 at 8:50 AM, Daniel Imberman 
wrote:

> Hi all,
>
> So I've been attempting to reformat a project I'm working on to use the
> Dataset API and have been having some issues with encoding errors. From
> what I've read, I think that I should be able to store Arrays of primitive
> values in a dataset. However, the following class gives me encoding errors:
>
> case class InvertedIndex(partition:Int, docs:Array[Int],
> indices:Array[Long], weights:Array[Double])
>
> val inv = RDD[InvertedIndex]
> val invertedIndexDataset = sqlContext.createDataset(inv)
> invertedIndexDataset.groupBy(x => x.partition).mapGroups {
> //...
> }
>
> Could someone please help me understand what the issue is here? Can
> Datasets not currently handle Arrays of primitives, or is there something
> extra that I need to do to make them work?
>
> Thank you
>
>

Re: Logging trait in Spark 2.0

2016-06-24 Thread Ted Yu

See this related thread:

http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+

On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno  wrote:

> Hi,
>
> developing a Spark Streaming custom receiver I noticed that the Logging
> trait isn't accessible anymore in Spark 2.0.
>
> trait Logging in package internal cannot be accessed in package
> org.apache.spark.internal
>
> For developing a custom receiver what is the preferred way for logging ?
> Just using log4j dependency as any other Java/Scala library/application ?
>
> Thanks,
> Paolo
>
> *Paolo Patierno*
>
> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
> Embedded & IoT*
> *Microsoft Azure Advisor*
>
> Twitter : @ppatierno 
> Linkedin : paolopatierno 
> Blog : DevExperience 
>

Re: DataFrame versus Dataset creation and usage

2016-06-24 Thread Ted Yu

In Spark 2.0, Dataset and DataFrame are unified.

Would this simplify your use case ?

On Fri, Jun 24, 2016 at 7:27 AM, Martin Serrano  wrote:

> Hi,
>
> I'm exposing a custom source to the Spark environment.  I have a question
> about the best way to approach this problem.
>
> I created a custom relation for my source and it creates a
> DataFrame.  My custom source knows the data types which are *dynamic*
> so this seemed to be the appropriate return type.  This works fine.
>
> The next step I want to take is to expose some custom mapping functions
> (written in Java).  But when I look at the APIs, the map method for
> DataFrame returns an RDD (not a DataFrame).  (Should I use
> SqlContext.createDataFrame on the result? -- does this result in additional
> processing overhead?)  The Dataset type seems to be more of what I'd be
> looking for, it's map method returns the Dataset type.  So chaining them
> together is a natural exercise.
>
> But to create the Dataset from a DataFrame, it appears that I have to
> provide the types of each field in the Row in the DataFrame.as[...]
> method.  I would think that the DataFrame would be able to do this
> automatically since it has all the types already.
>
> This leads me to wonder how I should be approaching this effort.  As all
> the fields and types are dynamic, I cannot use beans as my type when
> passing data around.  Any advice would be appreciated.
>
> Thanks,
> Martin
>
>
>
>

Re: Kryo ClassCastException during Serialization/deserialization in Spark Streaming

2016-06-23 Thread Ted Yu

Can you illustrate how sampleMap is populated ?

Thanks

On Thu, Jun 23, 2016 at 12:34 PM, SRK  wrote:

> Hi,
>
> I keep getting the following error in my Spark Streaming every now and then
> after the  job runs for say around 10 hours. I have those 2 classes
> registered in kryo as shown below.  sampleMap is a field in SampleSession
> as shown below. Any suggestion as to how to avoid this would be of great
> help!!
>
> public class SampleSession implements Serializable, Cloneable{
>private Map sampleMap;
> }
>
>  sparkConf.registerKryoClasses(Array( classOf[SampleSession],
> classOf[Sample]))
>
>
>
> com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException:
> com.test.Sample cannot be cast to java.lang.String
> Serialization trace:
> sampleMap (com.test.SampleSession)
> at
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:585)
> at
>
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> com.twitter.chill.Tuple6Serializer.write(TupleSerializers.scala:96)
> at
> com.twitter.chill.Tuple6Serializer.write(TupleSerializers.scala:93)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:37)
> at
> com.twitter.chill.Tuple2Serializer.write(TupleSerializers.scala:33)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
> at
>
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:158)
> at
>
> org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:153)
> at
>
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1190)
> at
>
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1199)
> at
> org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:132)
> at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:793)
> at
> org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669)
> at
> org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:175)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Caused by: java.lang.ClassCastException: com.test.Sample cannot be cast to
> java.lang.String
> at
>
> com.esotericsoftware.kryo.serializers.DefaultSerializers$StringSerializer.write(DefaultSerializers.java:146)
> at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549)
> at
>
> com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:82)
> at
>
> com.esotericsoftware.kryo.serializers.MapSerializer.write(MapSerializer.java:17)
> at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
> at
>
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
> ... 37 more
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-ClassCastException-during-Serialization-deserialization-in-Spark-Streaming-tp27219.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>

Re: Multiple compute nodes in standalone mode

2016-06-23 Thread Ted Yu

Have you looked at:

https://spark.apache.org/docs/latest/spark-standalone.html

On Thu, Jun 23, 2016 at 12:28 PM, avendaon  wrote:

> Hi all,
>
> I have a cluster that has multiple nodes, and the data partition is
> unified,
> therefore all my nodes in my computer can access to the data I am working
> on. Right now, I run Spark in a single node, and it work beautifully.
>
> My question is, Is it possible to run Spark using multiple compute nodes
> (as
> a standalone mode, I don't have HDFS/Hadoop installed)? If so, what do I
> have to add/change to my Spark version or Spark script (either python or
> scala)?
>
> Thanks,
>
> Jose
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Multiple-compute-nodes-in-standalone-mode-tp27218.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: NullPointerException when starting StreamingContext

2016-06-22 Thread Ted Yu

Which Scala version / Spark release are you using ?

Cheers

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(ConfigFactory.load())
>   LogHandler.log.info("createDataStreamFromKafka")
>   KafkaUtils.createDirectStream[String,
> Array[Byte],
> StringDecoder,
> DefaultDecoder](
> ssc,
> Map[String, String]("metadata.broker.list" -> 
>

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu

The tar ball was built against hadoop 2.6 which is compatible with hadoop
2.7.2
So the answer should be yes.

On Wed, Jun 22, 2016 at 7:10 PM, 喜之郎 <251922...@qq.com> wrote:

> Thanks.
>
> In addition，I want to know, if I can use  spark-1.6.1-bin-hadoop2.6.tgz
> <http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz> which
> is a pre-built package on hadoop 2.7.2？
>
>
>
> ------ 原始邮件 --
> *发件人:* "Ted Yu";<yuzhih...@gmail.com>;
> *发送时间:* 2016年6月22日(星期三) 晚上11:51
> *收件人:* "喜之郎"<251922...@qq.com>;
> *抄送:* "user"<user@spark.apache.org>;
> *主题:* Re: spark-1.6.1-bin-without-hadoop can not use spark-sql
>
> build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
> -Dhadoop.version=2.7.2 package
>
> On Wed, Jun 22, 2016 at 8:00 AM, 251922566 <251922...@qq.com> wrote:
>
>> ok,i will rebuild myself. if i want to use spark with hadoop 2.7.2, when
>> i build spark, i should put what on param --hadoop, 2.7.2 or others?
>>
>> 来自我的华为手机
>>
>>
>>  原始邮件 
>> 主题：Re: spark-1.6.1-bin-without-hadoop can not use spark-sql
>> 发件人：Ted Yu
>> 收件人：喜之郎 <251922...@qq.com>
>> 抄送：user
>>
>>
>> I wonder if the tar ball was built with:
>>
>> -Phive -Phive-thriftserver
>>
>> Maybe rebuild by yourself with the above ?
>>
>> FYI
>>
>> On Wed, Jun 22, 2016 at 4:38 AM, 喜之郎 <251922...@qq.com> wrote:
>>
>>> Hi all.
>>> I download spark-1.6.1-bin-without-hadoop.tgz
>>> <http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-without-hadoop.tgz> 
>>> from
>>> website.
>>> And I configured "SPARK_DIST_CLASSPATH" in spark-env.sh.
>>> Now spark-shell run well. But spark-sql can not run.
>>> My hadoop version is 2.7.2.
>>> This is error infos:
>>>
>>> bin/spark-sql
>>> java.lang.ClassNotFoundException:
>>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> at java.lang.Class.forName0(Native Method)
>>> at java.lang.Class.forName(Class.java:278)
>>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>> Failed to load main class
>>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
>>> You need to build Spark with -Phive and -Phive-thriftserver.
>>>
>>> Do I need configure something else in spark-env.sh or spark-default.conf?
>>> Suggestions are appreciated ,thanks.
>>>
>>>
>>>
>>>
>>
>

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu

See the first example in:

http://www.w3schools.com/sql/sql_func_count.asp

On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:

> Hey Ted,
>
> thanks for reacting.
>
> I am refering to both of them. They both take column as parameter
> regardless of its type. Intuition here is that count should take no
> parameter. Or am I missing something?
>
> Jakub
>
> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Are you referring to the following method in
>> sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
>>
>>   def count(e: Column): Column = withAggregateFunction {
>>
>> Did you notice this method ?
>>
>>   def count(columnName: String): TypedColumn[Any, Long] =
>>
>> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Hey sparkers,
>>>
>>> an aggregate function *count* in *org.apache.spark.sql.functions*
>>> package takes a *column* as an argument. Is this needed for something?
>>> I find it confusing that I need to supply a column there. It feels like it
>>> might be distinct count or something. This can be seen in latest
>>> documentation
>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>>> .
>>>
>>> I am considering filling this in spark bug tracker. Any opinions on this?
>>>
>>> Thanks
>>>
>>> Jakub
>>>
>>>
>>
>

Re: Confusing argument of sql.functions.count

2016-06-22 Thread Ted Yu

Are you referring to the following method in
sql/core/src/main/scala/org/apache/spark/sql/functions.scala :

  def count(e: Column): Column = withAggregateFunction {

Did you notice this method ?

  def count(columnName: String): TypedColumn[Any, Long] =

On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
spark.dubovsky.ja...@gmail.com> wrote:

> Hey sparkers,
>
> an aggregate function *count* in *org.apache.spark.sql.functions* package
> takes a *column* as an argument. Is this needed for something? I find it
> confusing that I need to supply a column there. It feels like it might be
> distinct count or something. This can be seen in latest documentation
> 
> .
>
> I am considering filling this in spark bug tracker. Any opinions on this?
>
> Thanks
>
> Jakub
>
>

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu

build/mvn clean -Phive -Phive-thriftserver -Pyarn -Phadoop-2.6 -Psparkr
-Dhadoop.version=2.7.2 package

On Wed, Jun 22, 2016 at 8:00 AM, 251922566 <251922...@qq.com> wrote:

> ok,i will rebuild myself. if i want to use spark with hadoop 2.7.2, when i
> build spark, i should put what on param --hadoop, 2.7.2 or others?
>
> 来自我的华为手机
>
>
>  原始邮件 
> 主题：Re: spark-1.6.1-bin-without-hadoop can not use spark-sql
> 发件人：Ted Yu
> 收件人：喜之郎 <251922...@qq.com>
> 抄送：user
>
>
> I wonder if the tar ball was built with:
>
> -Phive -Phive-thriftserver
>
> Maybe rebuild by yourself with the above ?
>
> FYI
>
> On Wed, Jun 22, 2016 at 4:38 AM, 喜之郎 <251922...@qq.com> wrote:
>
>> Hi all.
>> I download spark-1.6.1-bin-without-hadoop.tgz
>> <http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-without-hadoop.tgz> 
>> from
>> website.
>> And I configured "SPARK_DIST_CLASSPATH" in spark-env.sh.
>> Now spark-shell run well. But spark-sql can not run.
>> My hadoop version is 2.7.2.
>> This is error infos:
>>
>> bin/spark-sql
>> java.lang.ClassNotFoundException:
>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> at java.lang.Class.forName0(Native Method)
>> at java.lang.Class.forName(Class.java:278)
>> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>> Failed to load main class
>> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
>> You need to build Spark with -Phive and -Phive-thriftserver.
>>
>> Do I need configure something else in spark-env.sh or spark-default.conf?
>> Suggestions are appreciated ,thanks.
>>
>>
>>
>>
>

Re: spark-1.6.1-bin-without-hadoop can not use spark-sql

2016-06-22 Thread Ted Yu

I wonder if the tar ball was built with:

-Phive -Phive-thriftserver

Maybe rebuild by yourself with the above ?

FYI

On Wed, Jun 22, 2016 at 4:38 AM, 喜之郎 <251922...@qq.com> wrote:

> Hi all.
> I download spark-1.6.1-bin-without-hadoop.tgz
>  from
> website.
> And I configured "SPARK_DIST_CLASSPATH" in spark-env.sh.
> Now spark-shell run well. But spark-sql can not run.
> My hadoop version is 2.7.2.
> This is error infos:
>
> bin/spark-sql
> java.lang.ClassNotFoundException:
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:278)
> at org.apache.spark.util.Utils$.classForName(Utils.scala:174)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:689)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Failed to load main class
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.
> You need to build Spark with -Phive and -Phive-thriftserver.
>
> Do I need configure something else in spark-env.sh or spark-default.conf?
> Suggestions are appreciated ,thanks.
>
>
>
>

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Ted Yu

For the run which returned incorrect result, did you observe any error (on
workers) ?

Cheers

On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel  wrote:

> I have an RDD[String, MyObj] which is a result of Join + Map operation. It
> has no partitioner info. I run reduceByKey without passing any Partitioner
> or partition counts.  I observed that output aggregation result for given
> key is incorrect sometime. like 1 out of 5 times. It looks like reduce
> operation is joining values from two different keys. There is no
> configuration change between multiple runs. I am scratching my head over
> this. I verified results by printing out RDD before and after reduce
> operation; collecting subset at driver.
>
> Besides shuffle and storage memory fraction I use following options:
>
> sparkConf.set("spark.driver.userClassPathFirst","true")
> sparkConf.set("spark.unsafe.offHeap","true")
> sparkConf.set("spark.reducer.maxSizeInFlight","128m")
> sparkConf.set("spark.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
>

Re: scala.NotImplementedError: put() should not be called on an EmptyStateMap while doing stateful computation on spark streaming

2016-06-21 Thread Ted Yu

Are you using 1.6.1 ?

If not, does the problem persist when you use 1.6.1 ?

Thanks

> On Jun 20, 2016, at 11:16 PM, umanga  wrote:
> 
> I am getting following warning while running stateful computation. The state
> consists of BloomFilter (stream-lib) as Value and Integer as key.
> 
> The program runs smoothly for few minutes and after that, i am getting this
> warning, and streaming app becomes unstable (processing time increases
> exponentially), and ultimately job fails.
> 
> 
> WARN TaskSetManager: Lost task 0.0 in stage 144.0 (TID 326, mesos-slave-02):
> scala.NotImplementedError: put() should not be called on an EmptyStateMap
>at org.apache.spark.streaming.util.EmptyStateMap.put(StateMap.scala:73)
>at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:62)
>at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>at
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:155)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>at java.lang.Thread.run(Thread.java:745)
> 
> I am using kryo serialization. From somewhere in internet, I am getting hint
> that this may be due to kryo serialization error for
> OpenHashMapBasedStateMap. But, I have no idea how to fix this.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/scala-NotImplementedError-put-should-not-be-called-on-an-EmptyStateMap-while-doing-stateful-computatg-tp27200.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Build Spark 2.0 succeeded but could not run it on YARN

2016-06-20 Thread Ted Yu

What operations did you run in the Spark shell ?

It would be easier for other people to reproduce using your code snippet.

Thanks

On Mon, Jun 20, 2016 at 6:20 PM, Jeff Zhang  wrote:

> Could you check the yarn app logs for details ? run command "yarn logs
> -applicationId " to get the yarn log
>
> On Tue, Jun 21, 2016 at 9:18 AM, wgtmac  wrote:
>
>> I ran into problems in building Spark 2.0. The build process actually
>> succeeded but when I uploaded to cluster and launched the Spark shell on
>> YARN, it reported following exceptions again and again:
>>
>> 16/06/17 03:32:00 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:
>> Container marked as failed: container_e437_1464601161543_1582846_01_13
>> on host: hadoopworker575-sjc1.. Exit status: 1.
>> Diagnostics:
>> Exception from container-launch.
>> Container id: container_e437_1464601161543_1582846_01_13
>> Exit code: 1
>> Stack trace: ExitCodeException exitCode=1:
>> at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>> at org.apache.hadoop.util.Shell.run(Shell.java:455)
>> at
>> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
>> at
>>
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Container exited with a non-zero exit code 1
>>
>> =
>> Build command:
>>
>> export JAVA_HOME=   // tried both java7 and java8
>> ./dev/change-scala-version.sh 2.11   // tried both 2.10 and 2.11
>> ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
>> -Phive-thriftserver -DskipTests clean package
>>
>> The 2.0.0-preview version downloaded from Spark website works well so it
>> is
>> not the problem of my cluster. Also I can make it to build Spark 1.5 and
>> 1.6
>> and run them on the cluster. But in Spark 2.0, I failed both 2.0.0-preview
>> tag and 2.0.0-SNAPSHOT. Anyone has any idea? Thanks!
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Build-Spark-2-0-succeeded-but-could-not-run-it-on-YARN-tp27199.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>

Re: Accessing system environment on Spark Worker

2016-06-19 Thread Ted Yu

Have you looked at http://spark.apache.org/docs/latest/ec2-scripts.html ?

There is description on setting AWS_SECRET_ACCESS_KEY.

On Sun, Jun 19, 2016 at 4:46 AM, Mohamed Taher AlRefaie 
wrote:

> Hello all:
>
> I have an application that requires accessing DynamoDB tables. Each worker
> establishes a connection with the database on its own.
>
> I have added both `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to both
> master's and workers `spark-env.sh` file. I have also run the file using
> `sh` to make sure the variables are exported.
>
> When the code runs, I always get the error:
>
> Caused by: com.amazonaws.AmazonClientException: Unable to load AWS
> credentials from any provider in the chain
> at
> com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:131)
> at
> com.amazonaws.http.AmazonHttpClient.getCredentialsFromContext(AmazonHttpClient.java:774)
> at
> com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:800)
> at
> com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:695)
> at
> com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:447)
> at
> com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:409)
> at
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:358)
> at
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.doInvoke(AmazonDynamoDBClient.java:2051)
> at
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.invoke(AmazonDynamoDBClient.java:2021)
> at
> com.amazonaws.services.dynamodbv2.AmazonDynamoDBClient.describeTable(AmazonDynamoDBClient.java:1299)
> at
> com.amazon.titan.diskstorage.dynamodb.DynamoDBDelegate.describeTable(DynamoDBDelegate.java:635)
> ... 27 more
>
> It seems that the AWS SDK has failed to load the credentials even though
> they're exported. When I do export command, it returns the credentials.
> What type of solution should I try?
>
> Thanks,
> Mohamed Taher Alrefaie.
>

Re: How to cause a stage to fail (using spark-shell)?

2016-06-19 Thread Ted Yu

You can utilize a counter in external storage (NoSQL e.g.)
When the counter reaches 2, stop throwing exception so that the task passes.

FYI

On Sun, Jun 19, 2016 at 3:22 AM, Jacek Laskowski  wrote:

> Hi,
>
> Thanks Burak for the idea, but it *only* fails the tasks that
> eventually fail the entire job not a particular stage (just once or
> twice) before the entire job is failed. The idea is to see the
> attempts in web UI as there's a special handling for cases where a
> stage failed once or twice before finishing up properly.
>
> Any ideas? I've got one but it requires quite an extensive cluster set
> up which I'd like to avoid if possible. Just something I could use
> during workshops or demos and others could reproduce easily to learn
> Spark's internals.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sun, Jun 19, 2016 at 5:25 AM, Burak Yavuz  wrote:
> > Hi Jacek,
> >
> > Can't you simply have a mapPartitions task throw an exception or
> something?
> > Are you trying to do something more esoteric?
> >
> > Best,
> > Burak
> >
> > On Sat, Jun 18, 2016 at 5:35 AM, Jacek Laskowski 
> wrote:
> >>
> >> Hi,
> >>
> >> Following up on this question, is a stage considered failed only when
> >> there is a FetchFailed exception? Can I have a failed stage with only
> >> a single-stage job?
> >>
> >> Appreciate any help on this...(as my family doesn't like me spending
> >> the weekend with Spark :))
> >>
> >> Pozdrawiam,
> >> Jacek Laskowski
> >> 
> >> https://medium.com/@jaceklaskowski/
> >> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> Follow me at https://twitter.com/jaceklaskowski
> >>
> >>
> >> On Sat, Jun 18, 2016 at 11:53 AM, Jacek Laskowski 
> wrote:
> >> > Hi,
> >> >
> >> > I'm trying to see some stats about failing stages in web UI and want
> >> > to "create" few failed stages. Is this possible using spark-shell at
> >> > all? Which setup of Spark/spark-shell would allow for such a scenario.
> >> >
> >> > I could write a Scala code if that's the only way to have failing
> >> > stages.
> >> >
> >> > Please guide. Thanks.
> >> >
> >> > /me on to reviewing the Spark code...
> >> >
> >> > Pozdrawiam,
> >> > Jacek Laskowski
> >> > 
> >> > https://medium.com/@jaceklaskowski/
> >> > Mastering Apache Spark http://bit.ly/mastering-apache-spark
> >> > Follow me at https://twitter.com/jaceklaskowski
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Switching broadcast mechanism from torrrent

2016-06-19 Thread Ted Yu

I think good practice is not to hold on to SparkContext in mapFunction.

On Sun, Jun 19, 2016 at 7:10 AM, Takeshi Yamamuro <linguin@gmail.com>
wrote:

> How about using `transient` annotations?
>
> // maropu
>
> On Sun, Jun 19, 2016 at 10:51 PM, Daniel Haviv <
> daniel.ha...@veracity-group.com> wrote:
>
>> Hi,
>> Just updating on my findings for future reference.
>> The problem was that after refactoring my code I ended up with a scala
>> object which held SparkContext as a member, eg:
>> object A  {
>>  sc: SparkContext = new SparkContext
>>  def mapFunction  {}
>> }
>>
>> and when I called rdd.map(A.mapFunction) it failed as A.sc is not
>> serializable.
>>
>> Thanks,
>> Daniel
>>
>> On Tue, Jun 7, 2016 at 10:13 AM, Takeshi Yamamuro <linguin@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> Since `HttpBroadcastFactory` has already been removed in master, so
>>> you cannot use the broadcast mechanism in future releases.
>>>
>>> Anyway, I couldn't find a root cause only from the stacktraces...
>>>
>>> // maropu
>>>
>>>
>>>
>>>
>>> On Mon, Jun 6, 2016 at 2:14 AM, Daniel Haviv <
>>> daniel.ha...@veracity-group.com> wrote:
>>>
>>>> Hi,
>>>> I've set  spark.broadcast.factory to
>>>> org.apache.spark.broadcast.HttpBroadcastFactory and it indeed resolve my
>>>> issue.
>>>>
>>>> I'm creating a dataframe which creates a broadcast variable internally
>>>> and then fails due to the torrent broadcast with the following stacktrace:
>>>> Caused by: org.apache.spark.SparkException: Failed to get
>>>> broadcast_3_piece0 of broadcast_3
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
>>>> at scala.Option.getOrElse(Option.scala:120)
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
>>>> at scala.collection.immutable.List.foreach(List.scala:318)
>>>> at org.apache.spark.broadcast.TorrentBroadcast.org
>>>> $apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
>>>> at
>>>> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
>>>> at
>>>> org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1220)
>>>>
>>>> I'm using spark 1.6.0 on CDH 5.7
>>>>
>>>> Thanks,
>>>> Daniel
>>>>
>>>>
>>>> On Wed, Jun 1, 2016 at 5:52 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> I found spark.broadcast.blockSize but no parameter to switch broadcast
>>>>> method.
>>>>>
>>>>> Can you describe the issues with torrent broadcast in more detail ?
>>>>>
>>>>> Which version of Spark are you using ?
>>>>>
>>>>> Thanks
>>>>>
>>>>> On Wed, Jun 1, 2016 at 7:48 AM, Daniel Haviv <
>>>>> daniel.ha...@veracity-group.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>> Our application is failing due to issues with the torrent broadcast,
>>>>>> is there a way to switch to another broadcast method ?
>>>>>>
>>>>>> Thank you.
>>>>>> Daniel
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Dataset Select Function after Aggregate Error

2016-06-18 Thread Ted Yu

scala> ds.groupBy($"_1").count.select(expr("_1").as[String],
expr("count").as[Long])
res0: org.apache.spark.sql.Dataset[(String, Long)] = [_1: int, count:
bigint]

scala> ds.groupBy($"_1").count.select(expr("_1").as[String],
expr("count").as[Long]).show
+---+-+
| _1|count|
+---+-+
|  1|1|
|  2|1|
+---+-+

On Sat, Jun 18, 2016 at 8:29 AM, Pedro Rodriguez 
wrote:

> I am curious if there is a way to call this so that it becomes a compile
> error rather than runtime error:
>
> // Note mispelled count and name
> ds.groupBy($"name").count.select('nam, $"coun").show
>
> More specifically, what are the best type safety guarantees that Datasets
> provide? It seems like with Dataframes there is still the unsafety of
> specifying column names by string/symbol and expecting the type to be
> correct and exist, but if you do something like this then downstream code
> is safer:
>
> // This is Array[(String, Long)] instead of Array[sql.Row]
> ds.groupBy($"name").count.select('name.as[String], 'count.as
> [Long]).collect()
>
> Does that seem like a correct understanding of Datasets?
>
> On Sat, Jun 18, 2016 at 6:39 AM, Pedro Rodriguez 
> wrote:
>
>> Looks like it was my own fault. I had spark 2.0 cloned/built, but had the
>> spark shell in my path so somehow 1.6.1 was being used instead of 2.0.
>> Thanks
>>
>> On Sat, Jun 18, 2016 at 1:16 AM, Takeshi Yamamuro 
>> wrote:
>>
>>> which version you use?
>>> I passed in 2.0-preview as follows;
>>> ---
>>>
>>> Spark context available as 'sc' (master = local[*], app id =
>>> local-1466234043659).
>>>
>>> Spark session available as 'spark'.
>>>
>>> Welcome to
>>>
>>>     __
>>>
>>>  / __/__  ___ _/ /__
>>>
>>> _\ \/ _ \/ _ `/ __/  '_/
>>>
>>>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0-preview
>>>
>>>   /_/
>>>
>>>
>>>
>>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>>> 1.8.0_31)
>>>
>>> Type in expressions to have them evaluated.
>>>
>>> Type :help for more information.
>>>
>>>
>>> scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
>>>
>>> hive.metastore.schema.verification is not enabled so recording the
>>> schema version 1.2.0
>>>
>>> ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]
>>>
>>> scala> ds.groupBy($"_1").count.select($"_1", $"count").show
>>>
>>> +---+-+
>>>
>>> | _1|count|
>>>
>>> +---+-+
>>>
>>> |  1|1|
>>>
>>> |  2|1|
>>>
>>> +---+-+
>>>
>>>
>>>
>>> On Sat, Jun 18, 2016 at 3:09 PM, Pedro Rodriguez <
>>> ski.rodrig...@gmail.com> wrote:
>>>
 I went ahead and downloaded/compiled Spark 2.0 to try your code snippet
 Takeshi. It unfortunately doesn't compile.

 scala> val ds = Seq[Tuple2[Int, Int]]((1, 0), (2, 0)).toDS
 ds: org.apache.spark.sql.Dataset[(Int, Int)] = [_1: int, _2: int]

 scala> ds.groupBy($"_1").count.select($"_1", $"count").show
 :28: error: type mismatch;
  found   : org.apache.spark.sql.ColumnName
  required: org.apache.spark.sql.TypedColumn[(org.apache.spark.sql.Row,
 Long),?]
   ds.groupBy($"_1").count.select($"_1", $"count").show
  ^

 I also gave a try to Xinh's suggestion using the code snippet below
 (partially from spark docs)
 scala> val ds = Seq(Person("Andy", 32), Person("Andy", 2),
 Person("Pedro", 24), Person("Bob", 42)).toDS()
 scala> ds.groupBy(_.name).count.select($"name".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
 input columns: [];
 scala> ds.groupBy(_.name).count.select($"_1".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve 'name' given
 input columns: [];
 scala> ds.groupBy($"name").count.select($"_1".as[String]).show
 org.apache.spark.sql.AnalysisException: cannot resolve '_1' given input
 columns: [];

 Looks like there are empty columns for some reason, the code below
 works fine for the simple aggregate
 scala> ds.groupBy(_.name).count.show

 Would be great to see an idiomatic example of using aggregates like
 these mixed with spark.sql.functions.

 Pedro

 On Fri, Jun 17, 2016 at 9:59 PM, Pedro Rodriguez <
 ski.rodrig...@gmail.com> wrote:

> Thanks Xinh and Takeshi,
>
> I am trying to avoid map since my impression is that this uses a Scala
> closure so is not optimized as well as doing column-wise operations is.
>
> Looks like the $ notation is the way to go, thanks for the help. Is
> there an explanation of how this works? I imagine it is a method/function
> with its name defined as $ in Scala?
>
> Lastly, are there prelim Spark 2.0 docs? If there isn't a good
> description/guide of using this syntax I would be willing to contribute
> some documentation.
>
> Pedro
>
>

Re: spark-xml - xml parsing when rows only have attributes

2016-06-17 Thread Ted Yu

Please see https://github.com/databricks/spark-xml/issues/92

On Fri, Jun 17, 2016 at 5:19 AM, VG  wrote:

> I am using spark-xml for loading data and creating a data frame.
>
> If xml element has sub elements and values, then it works fine. Example
>  if the xml element is like
>
> 
>  test
> 
>
> however if the xml element is bare with just attributes, then it does not
> work - Any suggestions.
>   Does not load the data
>
>
>
> Any suggestions to fix this
>
>
>
>
>
>
> On Fri, Jun 17, 2016 at 4:28 PM, Siva A  wrote:
>
>> Use Spark XML version,0.3.3
>> 
>> com.databricks
>> spark-xml_2.10
>> 0.3.3
>> 
>>
>> On Fri, Jun 17, 2016 at 4:25 PM, VG  wrote:
>>
>>> Hi Siva
>>>
>>> This is what i have for jars. Did you manage to run with these or
>>> different versions ?
>>>
>>>
>>> 
>>> org.apache.spark
>>> spark-core_2.10
>>> 1.6.1
>>> 
>>> 
>>> org.apache.spark
>>> spark-sql_2.10
>>> 1.6.1
>>> 
>>> 
>>> com.databricks
>>> spark-xml_2.10
>>> 0.2.0
>>> 
>>> 
>>> org.scala-lang
>>> scala-library
>>> 2.10.6
>>> 
>>>
>>> Thanks
>>> VG
>>>
>>>
>>> On Fri, Jun 17, 2016 at 4:16 PM, Siva A 
>>> wrote:
>>>
 Hi Marco,

 I did run in IDE(Intellij) as well. It works fine.
 VG, make sure the right jar is in classpath.

 --Siva

 On Fri, Jun 17, 2016 at 4:11 PM, Marco Mistroni 
 wrote:

> and  your eclipse path is correct?
> i suggest, as Siva did before, to build your jar and run it via
> spark-submit  by specifying the --packages option
> it's as simple as run this command
>
> spark-submit   --packages
> com.databricks:spark-xml_:   --class  of
> your class containing main> 
>
> Indeed, if you have only these lines to run, why dont you try them in
> spark-shell ?
>
> hth
>
> On Fri, Jun 17, 2016 at 11:32 AM, VG  wrote:
>
>> nopes. eclipse.
>>
>>
>> On Fri, Jun 17, 2016 at 3:58 PM, Siva A 
>> wrote:
>>
>>> If you are running from IDE, Are you using Intellij?
>>>
>>> On Fri, Jun 17, 2016 at 3:20 PM, Siva A 
>>> wrote:
>>>
 Can you try to package as a jar and run using spark-submit

 Siva

 On Fri, Jun 17, 2016 at 3:17 PM, VG  wrote:

> I am trying to run from IDE and everything else is working fine.
> I added spark-xml jar and now I ended up into this dependency
>
> 6/06/17 15:15:57 INFO BlockManagerMaster: Registered BlockManager
> Exception in thread "main" *java.lang.NoClassDefFoundError:
> scala/collection/GenTraversableOnce$class*
> at
> org.apache.spark.sql.execution.datasources.CaseInsensitiveMap.(ddl.scala:150)
> at
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:154)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
> at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
> at org.ariba.spark.PostsProcessing.main(PostsProcessing.java:19)
> Caused by:* java.lang.ClassNotFoundException:
> scala.collection.GenTraversableOnce$class*
> at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 5 more
> 16/06/17 15:15:58 INFO SparkContext: Invoking stop() from shutdown
> hook
>
>
>
> On Fri, Jun 17, 2016 at 2:59 PM, Marco Mistroni <
> mmistr...@gmail.com> wrote:
>
>> So you are using spark-submit  or spark-shell?
>>
>> you will need to launch either by passing --packages option (like
>> in the example below for spark-csv). you will need to iknow
>>
>> --packages com.databricks:spark-xml_:> version>
>>
>> hth
>>
>>
>>
>> On Fri, Jun 17, 2016 at 10:20 AM, VG  wrote:
>>
>>> Apologies for that.
>>> I am trying to use spark-xml to load data of a xml file.
>>>
>>> here is the exception
>>>
>>> 16/06/17 14:49:04 INFO BlockManagerMaster: Registered
>>> BlockManager
>>> Exception in thread "main" java.lang.ClassNotFoundException:
>>> Failed to find data source: org.apache.spark.xml. Please find 
>>> packages at
>>> http://spark-packages.org
>>> at
>>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>>> at
>>>

Re: Spark jobs without a login

2016-06-16 Thread Ted Yu

Can you describe more about the container ?

Please show complete stack trace for the exception.

Thanks

On Thu, Jun 16, 2016 at 1:32 PM, jay vyas 
wrote:

> Hi spark:
>
> Is it possible to avoid reliance on a login user when running a spark job?
>
> I'm running out a container that doesnt supply a valid user name,
> and thus, I'm getting the following exception:
>
>
> org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:675)
>
> I'm not too worries about this - but it seems like it might be nice if
> maybe we could specify a user name as part of sparks context or as part of
> an external parameter rather then having to
>  use the java based user/group extractor.
>
>
>
> --
> jay vyas
>

Re: Kerberos setup in Apache spark connecting to remote HDFS/Yarn

2016-06-16 Thread Ted Yu

bq. Caused by: KrbException: Cannot locate default realm

Can you show the rest of the stack trace ?

What versions of Spark / Hadoop are you using ?

Which version of Java are you using (local and in cluster) ?

Thanks

On Thu, Jun 16, 2016 at 6:32 AM, akhandeshi  wrote:

> I am trying to setup my IDE to a scala spark application.  I want to access
> HDFS files from remote Hadoop server that has Kerberos enabled.  My
> understanding is I should be able to do that from Spark.  Here is my code
> so
> far:
>
> val sparkConf = new SparkConf().setAppName(appName).setMaster(master);
>
> if(jars.length>0) {
> sparkConf.setJars(jars);
> }
>
> if(!properties.isEmpty) {
> //val iter = properties.keys.iterator
> for((k,v)<-properties)
> sparkConf.set(k, v);
> } else {
> sparkConf
> .set("spark.executor.memory", "1024m")
> .set("spark.cores.max", "1")
> .set("spark.default.parallelism", "4");
> }
>
> try {
> if(!StringUtils.isBlank(principal) &&
> !StringUtils.isBlank(keytab)) {
> //UserGroupInformation.setConfiguration(config);
>
> UserGroupInformation.loginUserFromKeytab(principal, keytab);
> }
> } catch  {
>   case ioe:IOException =>{
> println("Failed to login to Hadoop [principal = "
> + principal + ", keytab
> = " + keytab + "]");
> ioe.printStackTrace();}
> }
>  val sc = new SparkContext(sparkConf)
>val MY_FILE: String =
> "hdfs://remoteserver:port/file.out"
>val rDD = sc.textFile(MY_FILE,10)
>println("Lines "+rDD.count);
>
> I have core-site.xml in my classpath.  I changed hadoop.ssl.enabled to
> false
> as it was expecting a secret key.  The principal I am using is correct.  I
> tried username/_HOST@fully.qualified.domain and
> username@fully.qualified.domain with no success.  I tried running spark in
> local mode and yarn client mode.   I am hoping someone has a recipe/solved
> this problem.  Any pointers to help setup/debug this problem will be
> helpful.
>
> I am getting following error message:
>
> Caused by: java.lang.IllegalArgumentException: Can't get Kerberos realm
> at
>
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:65)
> at
>
> org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:227)
> at
>
> org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:249)
> at
> org.apache.spark.examples.SparkYarn$.launchClient(SparkYarn.scala:55)
> at org.apache.spark.examples.SparkYarn$.main(SparkYarn.scala:83)
> at org.apache.spark.examples.SparkYarn.main(SparkYarn.scala)
> ... 6 more
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
>
> org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:75)
> at
>
> org.apache.hadoop.security.HadoopKerberosName.setConfiguration(HadoopKerberosName.java:63)
> ... 11 more
> Caused by: KrbException: Cannot locate default realm
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Kerberos-setup-in-Apache-spark-connecting-to-remote-HDFS-Yarn-tp27181.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Reporting warnings from workers

2016-06-15 Thread Ted Yu

Have you looked at:

https://spark.apache.org/docs/latest/programming-guide.html#accumulators

On Wed, Jun 15, 2016 at 1:24 PM, Mathieu Longtin 
wrote:

> Is there a way to report warnings from the workers back to the driver
> process?
>
> Let's say I have an RDD and do this:
>
> newrdd = rdd.map(somefunction)
>
> In *somefunction*, I want to catch when there are invalid values in *rdd *and
> either put them in another RDD or send some sort of message back.
>
> Is that possible?
> --
> Mathieu Longtin
> 1-514-803-8977
>

Re: Spark 2.0 release date

2016-06-15 Thread Ted Yu

Andy:
You should sense the tone in Mich's response.

To my knowledge, there hasn't been an RC for the 2.0 release yet.
Once we have an RC, it goes through the normal voting process.

FYI

On Wed, Jun 15, 2016 at 7:38 AM, andy petrella 
wrote:

> > tomorrow lunch time
> Which TZ :-) → I'm working on the update of some materials that Dean
> Wampler and myself will give tomorrow at Scala Days
> 
>  (well
> tomorrow CEST).
>
> Hence, I'm upgrading the materials on spark 2.0.0-preview, do you think
> 2.0.0 will be released before 6PM CEST (9AM PDT)? I don't want to be a joke
> in front of the audience with my almost cutting edge version :-P
>
> tx
>
>
> On Wed, Jun 15, 2016 at 3:59 PM Mich Talebzadeh 
> wrote:
>
>> Tomorrow lunchtime.
>>
>> Btw can you stop spamming every big data forum about good interview
>> questions book for big data!
>>
>> I have seen your mails on this big data book in spark, hive and tez
>> forums and I am sure there are many others. That seems to be the only mail
>> you send around.
>>
>> This forum is for technical discussions not for promotional material.
>> Please confine yourself to technical matters
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 15 June 2016 at 12:45, Chaturvedi Chola 
>> wrote:
>>
>>> when is the spark 2.0 release planned
>>>
>>
>> --
> andy
>

Re: hivecontext error

2016-06-14 Thread Ted Yu

Which release of Spark are you using ?

Can you show the full error trace ?

Thanks

On Tue, Jun 14, 2016 at 6:33 PM, Tejaswini Buche <
tejaswini.buche0...@gmail.com> wrote:

> I am trying to use hivecontext in spark. The following statements are
> running fine :
>
> from pyspark.sql import HiveContext
> sqlContext = HiveContext(sc)
>
> But, when i run the below statement,
>
> sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
>
> I get the following error :
>
> Java Package object not callable
>
> what could be the problem?
> thnx
>

Re: MAtcheERROR : STRINGTYPE

2016-06-14 Thread Ted Yu

Can you give a bit more detail ?

version of Spark

complete error trace

code snippet which reproduces the error

On Tue, Jun 14, 2016 at 9:54 AM, pseudo oduesp 
wrote:

> hello
>
> why i get this error
>
> when  using
>
> assembleur =  VectorAssembler(  inputCols=l_CDMVT,
> outputCol="aev"+"CODEM")
> output = assembler.transform(df_aev)
>
> L_CDMTV list of columns
>
>
> thanks  ?
>

Re: Spark Streaming application failing with Kerboros issue while writing data to HBase

2016-06-13 Thread Ted Yu

Can you show snippet of your code, please ?

Please refer to obtainTokenForHBase() in
yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala

Cheers

On Mon, Jun 13, 2016 at 4:44 AM, Kamesh  wrote:

> Hi All,
>  We are building a spark streaming application and that application writes
> data to HBase table. But writes/reads are failing with following exception
>
> 16/06/13 04:35:16 ERROR ipc.AbstractRpcClient: SASL authentication failed.
> The most likely cause is missing or invalid credentials. Consider 'kinit'.
>
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to
> find any Kerberos tgt)]
>
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> at
> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
>
> at
> org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:605)
>
> This application is failing at Executor machine. Executor is not able to
> pass the token. Can someone help me how to resolve this issue.
>
> *Environment Details*
> Spark Version : 1.6.1
> HBase Version : 1.0.0
> Hadoop Version : 2.6.0
>
> --
> Thanks & Regards
> Kamesh.
>

Re: Basic question. Access MongoDB data in Spark.

2016-06-13 Thread Ted Yu

Have you considered posting the question on stratio's mailing list ?

You may get faster response there.

On Mon, Jun 13, 2016 at 8:09 AM, Umair Janjua 
wrote:

> Hi guys,
>
> I have this super basic problem which I cannot figure out. Can somebody
> give me a hint.
>
> http://stackoverflow.com/questions/37793214/spark-mongdb-data-using-java
>
> Cheers
>

Re: Spark Getting data from MongoDB in JAVA

2016-06-12 Thread Ted Yu

What's the value of spark.version ?

Do you know which version of Spark mongodb connector 0.10.3 was built
against ?

You can use the following command to find out:
mvn dependency:tree

Maybe the Spark version you use is different from what mongodb connector
was built against.

On Fri, Jun 10, 2016 at 2:50 AM, Asfandyar Ashraf Malik <
asfand...@kreditech.com> wrote:

> Hi,
> I did not notice that I put it twice.
> I changed that and ran my program but it still gives the same error:
>
> java.lang.NoSuchMethodError:
> org.apache.spark.sql.catalyst.ScalaReflection$.typeOfObject()Lscala/PartialFunction;
>
>
> Cheers
>
>
>
> 2016-06-10 11:47 GMT+02:00 Alonso Isidoro Roman :
>
>> why *spark-mongodb_2.11 dependency is written twice in pom.xml?*
>>
>> Alonso Isidoro Roman
>> [image: https://]about.me/alonso.isidoro.roman
>>
>> 
>>
>> 2016-06-10 11:39 GMT+02:00 Asfandyar Ashraf Malik <
>> asfand...@kreditech.com>:
>>
>>> Hi,
>>> I am using Stratio library to get MongoDB to work with Spark but I get
>>> the following error:
>>>
>>> java.lang.NoSuchMethodError:
>>> org.apache.spark.sql.catalyst.ScalaReflection
>>>
>>> This is my code.
>>>
>>> ---
>>> *public static void main(String[] args) {*
>>>
>>> *JavaSparkContext sc = new JavaSparkContext("local[*]", "test
>>> spark-mongodb java"); *
>>> *SQLContext sqlContext = new SQLContext(sc); *
>>>
>>> *Map options = new HashMap(); *
>>> *options.put("host", "xyz.mongolab.com:59107
>>> "); *
>>> *options.put("database", "heroku_app3525385");*
>>> *options.put("collection", "datalog");*
>>> *options.put("credentials", "*,,");*
>>>
>>> *DataFrame df =
>>> sqlContext.read().format("com.stratio.datasource.mongodb").options(options).load();*
>>> *df.registerTempTable("datalog"); *
>>> *df.show();*
>>>
>>> *}*
>>>
>>> ---
>>> My pom file is as follows:
>>>
>>>  **
>>> **
>>> *org.apache.spark*
>>> *spark-core_2.11*
>>> *${spark.version}*
>>> **
>>> **
>>> *org.apache.spark*
>>> *spark-catalyst_2.11 *
>>> *${spark.version}*
>>> **
>>> **
>>> *org.apache.spark*
>>> *spark-sql_2.11*
>>> *${spark.version}*
>>> * *
>>> **
>>> *com.stratio.datasource*
>>> *spark-mongodb_2.11*
>>> *0.10.3*
>>> **
>>> **
>>> *com.stratio.datasource*
>>> *spark-mongodb_2.11*
>>> *0.10.3*
>>> *jar*
>>> **
>>> **
>>>
>>>
>>> Regards
>>>
>>
>>
>

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Ted Yu

Another source is the presentation on various ocnferences.
e.g.
http://www.slideshare.net/databricks/apache-spark-mllib-20-preview-data-science-and-production

FYI

On Sat, Jun 11, 2016 at 8:47 AM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Interesting.
>
> The pace of development in this field is such that practically every
> single book in Big Data landscape gets out of data before the ink dries on
> it  :)
>
> I concur that they serve as good reference for starters but in my opinion
> the best way to learn is to start from on-line docs (and these are pretty
> respectful when it comes to Spark) and progress from there.
>
> If you have a certain problem then put to this group and I am sure someone
> somewhere in this forum has come across it. Also most of these books'
> authors actively contribute to this mailing list.
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 11 June 2016 at 16:10, Ted Yu <yuzhih...@gmail.com> wrote:
>
>>
>> https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib
>>
>>
>> https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib
>>
>>
>> https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib
>>
>>
>> On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel <deic...@gmail.com> wrote:
>>
>>>
>>> Hey
>>>
>>> Namaskara~Nalama~Guten Tag~Bonjour
>>>
>>> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>>>
>>> Which would be the best book to learn up?
>>>
>>> Thanks
>>> Deepak
>>>--
>>> Keigu
>>>
>>> Deepak
>>> 73500 12833
>>> www.simtree.net, dee...@simtree.net
>>> deic...@gmail.com
>>>
>>> LinkedIn: www.linkedin.com/in/deicool
>>> Skype: thumsupdeicool
>>> Google talk: deicool
>>> Blog: http://loveandfearless.wordpress.com
>>> Facebook: http://www.facebook.com/deicool
>>>
>>> "Contribute to the world, environment and more :
>>> http://www.gridrepublic.org
>>> "
>>>
>>
>>
>

Re: Book for Machine Learning (MLIB and other libraries on Spark)

2016-06-11 Thread Ted Yu

https://www.amazon.com/Machine-Learning-Spark-Powerful-Algorithms/dp/1783288515/ref=sr_1_1?ie=UTF8=1465657706=8-1=spark+mllib

https://www.amazon.com/Spark-Practical-Machine-Learning-Chinese/dp/7302420424/ref=sr_1_3?ie=UTF8=1465657706=8-3=spark+mllib

https://www.amazon.com/Advanced-Analytics-Spark-Patterns-Learning/dp/1491912766/ref=sr_1_2?ie=UTF8=1465657706=8-2=spark+mllib


On Sat, Jun 11, 2016 at 8:04 AM, Deepak Goel  wrote:

>
> Hey
>
> Namaskara~Nalama~Guten Tag~Bonjour
>
> I am a newbie to Machine Learning (MLIB and other libraries on Spark)
>
> Which would be the best book to learn up?
>
> Thanks
> Deepak
>--
> Keigu
>
> Deepak
> 73500 12833
> www.simtree.net, dee...@simtree.net
> deic...@gmail.com
>
> LinkedIn: www.linkedin.com/in/deicool
> Skype: thumsupdeicool
> Google talk: deicool
> Blog: http://loveandfearless.wordpress.com
> Facebook: http://www.facebook.com/deicool
>
> "Contribute to the world, environment and more :
> http://www.gridrepublic.org
> "
>

Re: OutOfMemory when doing joins in spark 2.0 while same code runs fine in spark 1.5.2

2016-06-09 Thread Ted Yu

bq. Read data from hbase using custom DefaultSource (implemented using
TableScan)

Did you use the DefaultSource from hbase-spark module in hbase master
branch ?
If you wrote your own, mind sharing related code ?

Thanks

On Thu, Jun 9, 2016 at 2:53 AM, raaggarw  wrote:

> Hi,
>
> I was trying to port my code from spark 1.5.2 to spark 2.0 however i faced
> some outofMemory issues. On drilling down i could see that OOM is because
> of
> join, because removing join fixes the issue. I then created a small
> spark-app to reproduce this:
>
> (48 cores, 300gb ram - divided among 4 workers)
>
> line1 :df1 = Read a set a of parquet files into dataframe
> line2: df1.count
> line3: df2 = Read data from hbase using custom DefaultSource (implemented
> using TableScan)
> line4: df2.count
> line5: df3 = df1.join(df2, df1("field1") === df2("field2"), "inner")
> line6: df3.count -> *this is where it fails in Spark 2.0 and runs fine in
> spark 1.5.2*
>
> Problem:
> First lot of WARN messages
> 2016-06-09 08:14:18,884 WARN  [broadcast-exchange-0]
> memory.TaskMemoryManager (TaskMemoryManager.java:allocatePage(264)) -
> Failed
> to allocate a page (1048576 bytes), try again.
> And then OOM
>
> I then tried to dump data fetched from hbase into s3 and then created df2
> from s3 rather than hbase, then it worked fine in spark 2.0 as well.
>
> Could someone please guide me through next steps?
>
> Thanks
> Ravi
> Computer Scientist @ Adobe
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemory-when-doing-joins-in-spark-2-0-while-same-code-runs-fine-in-spark-1-5-2-tp27124.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Write Ahead Log

2016-06-08 Thread Ted Yu

There was a minor typo in the name of the config:

spark.streaming.receiver.writeAheadLog.enable

Yes, it only applies to Streaming.

On Wed, Jun 8, 2016 at 3:14 PM, Mohit Anchlia 
wrote:

> Is something similar to park.streaming.receiver.writeAheadLog.enable
> available on SparkContext? It looks like it only works for spark streaming.
>

Re: comparaing row in pyspark data frame

2016-06-08 Thread Ted Yu

Do you mean returning col3 and 0.4 for the example row below ?

> On Jun 8, 2016, at 5:05 AM, pseudo oduesp  wrote:
> 
> Hi,
> how we can compare multiples columns in datframe i mean 
> 
> if  df it s dataframe like that :
> 
>df.col1 | df.col2 | df.col3 
>0.2  0.3  0.4 
> 
> how we can compare columns to get max of row not columns and get name of 
> columns where max it present ?
> 
> thanks

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Apache design patterns

2016-06-07 Thread Ted Yu

I think this is the correct forum. 

Please describe your case. 

> On Jun 7, 2016, at 8:33 PM, Francois Le Roux  wrote:
> 
> HI folks, I have been working through the available online Apache spark  
> tutorials and I am stuck with a scenario that i would like to solve in SPARK. 
> Is this a forum where i can publish a narrative for the problem / scenario 
> that i am trying to solve ? 
> 
> 
> 
> any assitance appreciated
> 
> 
> 
> thanks
> 
> frank

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1611 matches

Mail list logo