Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Takeshi Yamamuro
Hi, Mich

Did you check the URL Josh referred to?;
the cast for string comparisons is needed for accepting `c_date >= "2016"`.

// maropu


On Fri, Apr 15, 2016 at 10:30 AM, Hyukjin Kwon  wrote:

> Hi,
>
>
> String comparison itself is pushed down fine but the problem is to deal
> with Cast.
>
>
> It was pushed down before but is was reverted, (
> https://github.com/apache/spark/pull/8049).
>
> Several fixes were tried here, https://github.com/apache/spark/pull/11005
> and etc. but there were no changes to make it.
>
>
> To cut it short, it is not being pushed down because it is unsafe to
> resolve cast (eg. long to integer)
>
> For an workaround,  the implementation of Solr data source should be
> changed to one with CatalystScan, which take all the filters.
>
> But CatalystScan is not designed to be binary compatible across releases,
> however it looks some think it is stable now, as mentioned here,
> https://github.com/apache/spark/pull/10750#issuecomment-175400704.
>
>
> Thanks!
>
>
> 2016-04-15 3:30 GMT+09:00 Mich Talebzadeh :
>
>> Hi Josh,
>>
>> Can you please clarify whether date comparisons as two strings work at
>> all?
>>
>> I was under the impression is that with string comparison only first
>> characters are compared?
>>
>> Thanks
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 April 2016 at 19:26, Josh Rosen  wrote:
>>
>>> AFAIK this is not being pushed down because it involves an implicit cast
>>> and we currently don't push casts into data sources or scans; see
>>> https://github.com/databricks/spark-redshift/issues/155 for a
>>> possibly-related discussion.
>>>
>>> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
 Are you comparing strings in here or timestamp?

 Filter ((cast(registration#37 as string) >= 2015-05-28) &&
 (cast(registration#37 as string) <= 2015-05-29))


 Dr Mich Talebzadeh



 LinkedIn * 
 https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
 *



 http://talebzadehmich.wordpress.com



 On 14 April 2016 at 18:04, Kiran Chitturi <
 kiran.chitt...@lucidworks.com> wrote:

> Hi,
>
> Timestamp range filter queries in SQL are not getting pushed down to
> the PrunedFilteredScan instances. The filtering is happening at the Spark
> layer.
>
> The physical plan for timestamp range queries is not showing the
> pushed filters where as range queries on other types is working fine as 
> the
> physical plan is showing the pushed filters.
>
> Please see below for code and examples.
>
> *Example:*
>
> *1.* Range filter queries on Timestamp types
>
>*code: *
>
>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>
>*Full example*:
> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
> *plan*:
> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>
> *2. * Range filter queries on Long types
>
> *code*:
>
>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>> `length` <= '1000'")
>
> *Full example*:
> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
> *plan*:
> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>
> The SolrRelation class we use extends
> 
> the PrunedFilteredScan.
>
> Since Solr supports date ranges, I would like for the timestamp
> filters to be pushed down to the Solr query.
>
> Are there limitations on the type of filters that are passed down with
> Timestamp types ?
> Is there something that I should do in my code to fix this ?
>
> Thanks,
> --
> Kiran Chitturi
>
>

>>
>


-- 
---
Takeshi Yamamuro


Re: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Takeshi Yamamuro
Hi,

Seems you cannot use reserved words (e.g., sum and avg) in the Spark SQL
parser because an input string in filter is processed by the parser inside.

// maropu

On Thu, Apr 14, 2016 at 11:14 PM,  wrote:

> Appreciated Michael, but this doesn’t help my case, the filter string is
> being submitted from outside my program, is there any other alternative?
> some literal string parser or anything I can do before?
>
>
>
> Saif
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Wednesday, April 13, 2016 6:29 PM
> *To:* Ellafi, Saif A.
> *Cc:* user
> *Subject:* Re: Strange bug: Filter problem with parenthesis
>
>
>
> You need to use `backticks` to reference columns that have non-standard
> characters.
>
>
>
> On Wed, Apr 13, 2016 at 6:56 AM,  wrote:
>
> Hi,
>
>
>
> I am debugging a program, and for some reason, a line calling the
> following is failing:
>
>
>
> df.filter("sum(OpenAccounts) > 5").show
>
>
>
> It says it cannot find the column *OpenAccounts*, as if it was applying
> the sum() function and looking for a column called like that, where there
> is not. This works fine if I rename the column to something without
> parenthesis.
>
>
>
> I can’t reproduce this issue in Spark Shell (1.6.0), any ideas on how can
> I analyze this? This is an aggregation result, with the default column
> names afterwards.
>
>
>
> PS: Workaround is to use toDF(cols) and rename all columns, but I am
> wondering if toDF has any impact on the RDD structure behind (e.g.
> repartitioning, cache, etc)
>
>
>
> Appreciated,
>
> Saif
>
>
>
>
>



-- 
---
Takeshi Yamamuro


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Takeshi Yamamuro
Hi,

How about checking Spark survey result 2015 in
https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html
for the statistics?

// maropu

On Fri, Apr 15, 2016 at 4:52 AM, Mark Hamstra 
wrote:

> That's also available in standalone.
>
> On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
>> Spark on Yarn supports dynamic resource allocation
>>
>> So, you can run several spark-shells / spark-submits / spark-jobserver /
>> zeppelin on one cluster without defining upfront how many executors /
>> memory you want to allocate to each app
>>
>> Great feature for regular users who just want to run Spark / Spark SQL
>>
>>
>> On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen  wrote:
>>
>>> I don't think usage is the differentiating factor. YARN and standalone
>>> are pretty well supported. If you are only running a Spark cluster by
>>> itself with nothing else, standalone is probably simpler than setting
>>> up YARN just for Spark. However if you're running on a cluster that
>>> will host other applications, you'll need to integrate with a shared
>>> resource manager and its security model, and for anything
>>> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>>>
>>> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>>>  wrote:
>>> > AWS EMR includes Spark on Yarn
>>> > Hortonworks and Cloudera platforms include Spark on Yarn as well
>>> >
>>> >
>>> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
>>> arkadiusz.b...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>>> >> production ?
>>> >>
>>> >> I would like to choose most supported and used technology in
>>> >> production for our project.
>>> >>
>>> >>
>>> >> BR,
>>> >>
>>> >> Arkadiusz Bicz
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Takeshi Yamamuro
Hi,

See SPARK-2883  for ORC
supports.

// maropu

On Fri, Apr 15, 2016 at 11:22 AM, Ted Yu  wrote:

> For Parquet, please take a look at SPARK-1251
>
> For ORC, not sure.
> Looking at git history, I found ORC mentioned by SPARK-1368
>
> FYI
>
> On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli  wrote:
>
>> I am needing this fact for the research paper I am writing right now.
>>
>> When did Spark start supporting Parquet and when ORC?
>> (what release)
>>
>> I appreciate any info you can offer.
>>
>> Thank you,
>> Edmon
>>
>
>


-- 
---
Takeshi Yamamuro


Re: When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Ted Yu
For Parquet, please take a look at SPARK-1251

For ORC, not sure.
Looking at git history, I found ORC mentioned by SPARK-1368

FYI

On Thu, Apr 14, 2016 at 6:53 PM, Edmon Begoli  wrote:

> I am needing this fact for the research paper I am writing right now.
>
> When did Spark start supporting Parquet and when ORC?
> (what release)
>
> I appreciate any info you can offer.
>
> Thank you,
> Edmon
>


decline offer timeout

2016-04-14 Thread Rodrick Brown
  
I have hundreds of small spark jobs running on my Mesos cluster causing
starvation to other frameworks like Marathon on my cluster.

  

Is their a way to prevent these frameworks from getting offers so often?

  

Apr 15 02:00:12 prod-mesos-m-3.$SERVER.com mesos-master[10259]: I0415
02:00:12.503734 10266 master.cpp:3641] Processing DECLINE call for offers: [
50ceafa4-f3c1-4738-a9eb-c5d3bf0ff742-O7112667 ] for framework
50ceafa4-f3c1-4738-a9eb-c5d3bf0ff742-15936 (KafkaDirectConsumer[trades-topic])
at scheduler-9e557d33-e4a4-44ce-9dbe-0a7ca7c4842d@172.x.x.x:34858.  

  

  
  
  
\--

**Rodrick Brown** / Systems Engineer 

+1 917 445 6839 /
[rodr...@orchardplatform.com](mailto:char...@orchardplatform.com)

**Orchard Platform** 

101 5th Avenue, 4th Floor, New York, NY 10003

[http://www.orchardplatform.com](http://www.orchardplatform.com/)

[Orchard Blog](http://www.orchardplatform.com/blog/) | [Marketplace Lending
Meetup](http://www.meetup.com/Peer-to-Peer-Lending-P2P/)


-- 
*NOTICE TO RECIPIENTS*: This communication is confidential and intended for 
the use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an 
offer to sell or a solicitation of an indication of interest to purchase 
any loan, security or any other financial product or instrument, nor is it 
an offer to sell or a solicitation of an indication of interest to purchase 
any products or services to any persons who are prohibited from receiving 
such information under applicable law. The contents of this communication 
may not be accurate or complete and are subject to change without notice. 
As such, Orchard App, Inc. (including its subsidiaries and affiliates, 
"Orchard") makes no representation regarding the accuracy or completeness 
of the information contained herein. The intended recipient is advised to 
consult its own professional advisors, including those specializing in 
legal, tax and accounting matters. Orchard does not provide legal, tax or 
accounting advice.


When did Spark started supporting ORC and Parquet?

2016-04-14 Thread Edmon Begoli
I am needing this fact for the research paper I am writing right now.

When did Spark start supporting Parquet and when ORC?
(what release)

I appreciate any info you can offer.

Thank you,
Edmon


Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Hyukjin Kwon
Hi,


String comparison itself is pushed down fine but the problem is to deal
with Cast.


It was pushed down before but is was reverted, (
https://github.com/apache/spark/pull/8049).

Several fixes were tried here, https://github.com/apache/spark/pull/11005
and etc. but there were no changes to make it.


To cut it short, it is not being pushed down because it is unsafe to
resolve cast (eg. long to integer)

For an workaround,  the implementation of Solr data source should be
changed to one with CatalystScan, which take all the filters.

But CatalystScan is not designed to be binary compatible across releases,
however it looks some think it is stable now, as mentioned here,
https://github.com/apache/spark/pull/10750#issuecomment-175400704.


Thanks!


2016-04-15 3:30 GMT+09:00 Mich Talebzadeh :

> Hi Josh,
>
> Can you please clarify whether date comparisons as two strings work at all?
>
> I was under the impression is that with string comparison only first
> characters are compared?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 April 2016 at 19:26, Josh Rosen  wrote:
>
>> AFAIK this is not being pushed down because it involves an implicit cast
>> and we currently don't push casts into data sources or scans; see
>> https://github.com/databricks/spark-redshift/issues/155 for a
>> possibly-related discussion.
>>
>> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are you comparing strings in here or timestamp?
>>>
>>> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
>>> (cast(registration#37 as string) <= 2015-05-29))
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 April 2016 at 18:04, Kiran Chitturi >> > wrote:
>>>
 Hi,

 Timestamp range filter queries in SQL are not getting pushed down to
 the PrunedFilteredScan instances. The filtering is happening at the Spark
 layer.

 The physical plan for timestamp range queries is not showing the pushed
 filters where as range queries on other types is working fine as the
 physical plan is showing the pushed filters.

 Please see below for code and examples.

 *Example:*

 *1.* Range filter queries on Timestamp types

*code: *

> sqlContext.sql("SELECT * from events WHERE `registration` >=
> '2015-05-28' AND `registration` <= '2015-05-29' ")

*Full example*:
 https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
 *plan*:
 https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql

 *2. * Range filter queries on Long types

 *code*:

> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
> `length` <= '1000'")

 *Full example*:
 https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
 *plan*:
 https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql

 The SolrRelation class we use extends
 
 the PrunedFilteredScan.

 Since Solr supports date ranges, I would like for the timestamp filters
 to be pushed down to the Solr query.

 Are there limitations on the type of filters that are passed down with
 Timestamp types ?
 Is there something that I should do in my code to fix this ?

 Thanks,
 --
 Kiran Chitturi


>>>
>


Re: spark-ec2 hitting yum install issues

2016-04-14 Thread Nicholas Chammas
If you log into the cluster and manually try that step does it still fail?
Can you yum install anything else?

You might want to report this issue directly on the spark-ec2 repo, btw:
https://github.com/amplab/spark-ec2

Nick

On Thu, Apr 14, 2016 at 9:08 PM sanusha  wrote:

>
> I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the
> spark-ec2 to launch a cluster in
> Amazon VPC. The setup.sh script [run first thing on master after launch]
> uses pssh and tries to install it
> via 'yum install -y pssh'. This step always fails on the master AMI that
> the
> script uses by default as it is
> not able to find it in the repo mirrors - hits 403.
>
> Has anyone faced this and know what's causing it? For now, I have changed
> the script to not use pssh
> as a workaround. But would like to fix the root cause.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-hitting-yum-install-issues-tp26786.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


spark-ec2 hitting yum install issues

2016-04-14 Thread sanusha

I am using spark-1.6.1-prebuilt-with-hadoop-2.6 on mac. I am using the
spark-ec2 to launch a cluster in
Amazon VPC. The setup.sh script [run first thing on master after launch]
uses pssh and tries to install it 
via 'yum install -y pssh'. This step always fails on the master AMI that the
script uses by default as it is
not able to find it in the repo mirrors - hits 403. 

Has anyone faced this and know what's causing it? For now, I have changed
the script to not use pssh
as a workaround. But would like to fix the root cause.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-hitting-yum-install-issues-tp26786.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark/Parquet

2016-04-14 Thread Hyukjin Kwon
Currently Spark uses Parquet 1.7.0 (parquet-mr).

If you meant writer version2 (parquet-format), you can specify this by
manually setting as below:

sparkContext.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_2_0.toString)


2016-04-15 2:21 GMT+09:00 Younes Naguib :

> Hi all,
>
>
>
> When parquet 2.0 planned in Spark?
>
> Or is it already?
>
>
>
>
>
> *Younes Naguib*
>
> Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
>
> Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com 
>
>
>


Re: how to write pyspark interface to scala code?

2016-04-14 Thread Holden Karau
Its a bit tricky - if the users data is represented in a DataFrame or
Dataset then its much easier. Assuming that the function is going to be
called from the driver program (e.g. not inside of a transformation or
action) then you can use the Py4J context to make the calls. You might find
looking at wrapper.py in the ml directory to help some.

On Tue, Apr 12, 2016 at 4:30 PM, AlexG  wrote:

> I have Scala Spark code for computing a matrix factorization. I'd like to
> make it possible to use this code from PySpark, so users can pass in a
> python RDD and receive back one without knowing or caring that Scala code
> is
> being called.
>
> Please point me to an example of code (e.g. somewhere in the Spark
> codebase,
> if it's clean enough) from which I can learn how to do this.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/how-to-write-pyspark-interface-to-scala-code-tp26765.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: error "Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe."

2016-04-14 Thread Holden Karau
The org.apache.spark.sql.execution.EvaluatePython.takeAndServe exception
can happen in a lot of places it might be easier to figure out if you have
a code snippet you can share where this is occurring?

On Wed, Apr 13, 2016 at 2:27 PM, AlexModestov 
wrote:

> I get this error.
> Who knows what does it mean?
>
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure:
> Exception while getting task result:
> org.apache.spark.storage.BlockFetchException: Failed to fetch block from 1
> locations. Most recent failure cause:
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
> at scala.Option.foreach(Option.scala:236)
> at
>
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
> at
>
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1025)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1007)
> at
> org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1397)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
> at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1384)
> at
>
> org.apache.spark.sql.execution.TakeOrderedAndProject.collectData(basicOperators.scala:213)
> at
>
> org.apache.spark.sql.execution.TakeOrderedAndProject.doExecute(basicOperators.scala:223)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
>
> org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
> at
>
> org.apache.spark.sql.execution.Union$$anonfun$doExecute$1.apply(basicOperators.scala:144)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at
>
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at
> org.apache.spark.sql.execution.Union.doExecute(basicOperators.scala:144)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at
>
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at
>
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at
> org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:187)
> at
>
> 

Re: JSON Usage

2016-04-14 Thread Holden Karau
You could certainly use RDDs for that, you might also find using Dataset
selecting the fields you need to construct the URL to fetch and then using
the map function to be easier.

On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim  wrote:

> I was wonder what would be the best way to use JSON in Spark/Scala. I need
> to lookup values of fields in a collection of records to form a URL and
> download that file at that location. I was thinking an RDD would be perfect
> for this. I just want to hear from others who might have more experience in
> this. Below is the actual JSON structure that I am trying to use for the S3
> bucket and key values of each “record" within “Records".
>
> {
>"Records":[
>   {
>  "eventVersion":"2.0",
>  "eventSource":"aws:s3",
>  "awsRegion":"us-east-1",
>  "eventTime":The time, in ISO-8601 format, for example,
> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>  "eventName":"event-type",
>  "userIdentity":{
>
> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>  },
>  "requestParameters":{
> "sourceIPAddress":"ip-address-where-request-came-from"
>  },
>  "responseElements":{
> "x-amz-request-id":"Amazon S3 generated request ID",
> "x-amz-id-2":"Amazon S3 host that processed the request"
>  },
>  "s3":{
> "s3SchemaVersion":"1.0",
> "configurationId":"ID found in the bucket notification
> configuration",
> "bucket":{
>"name":"bucket-name",
>"ownerIdentity":{
>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>},
>"arn":"bucket-ARN"
> },
> "object":{
>"key":"object-key",
>"size":object-size,
>"eTag":"object eTag",
>"versionId":"object version if bucket is
> versioning-enabled, otherwise null",
>"sequencer": "a string representation of a hexadecimal
> value used to determine event sequence,
>only used with PUTs and DELETEs"
> }
>  }
>   },
>   {
>   // Additional events
>   }
>]
> }
>
> Thanks
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: Error with --files

2016-04-14 Thread Benjamin Zaitlen
That fixed it!

Thank you!

--Ben

On Thu, Apr 14, 2016 at 5:53 PM, Marcelo Vanzin  wrote:

> On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen 
> wrote:
> >> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
> >> /home/ubuntu/localtest.txt#appSees.txt
>
> --files should come before the path to your python script. Otherwise
> it's just passed as arguments to your script when it's run.
>
> --
> Marcelo
>


Re: How does spark-submit handle Python scripts (and how to repeat it)?

2016-04-14 Thread Andrei
Yes, I tried setting YARN_CONF_DIR, but with no luck. I will play around
with environment variables and system properties and post back in case of
success. Thanks for your help so far!

On Thu, Apr 14, 2016 at 5:48 AM, Sun, Rui  wrote:

> In SparkSubmit, there is less work for yarn-client than that for
> yarn-cluster. Basically prepare some spark configurations into system prop
> , for example, information on additional resources required by the
> application that need to be distributed to the cluster. These
> configurations will be used in SparkContext initialization later.
>
>
>
> So generally for yarn-client, maybe you can skip spark-submit and directly
> launching the spark application with some configurations setup before new
> SparkContext.
>
>
>
> Not sure about your error, have you setup YARN_CONF_DIR?
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Thursday, April 14, 2016 5:45 AM
>
> *To:* Sun, Rui 
> *Cc:* user 
> *Subject:* Re: How does spark-submit handle Python scripts (and how to
> repeat it)?
>
>
>
> Julia can pick the env var, and set the system properties or directly fill
> the configurations into a SparkConf, and then create a SparkContext
>
>
>
> That's the point - just setting master to "yarn-client" doesn't work, even
> in Java/Scala. E.g. following code in *Scala*:
>
>
> val conf = new SparkConf().setAppName("My App").setMaster("yarn-client")
> val sc = new SparkContext(conf)
> sc.parallelize(1 to 10).collect()
> sc.stop()
>
>
>
> results in an error:
>
>
>
> Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032
>
>
>
> I think for now we can even put Julia aside and concentrate the following
> question: how does submitting application via `spark-submit` with
> "yarn-client" mode differ from setting the same mode directly in
> `SparkConf`?
>
>
>
>
>
>
>
> On Wed, Apr 13, 2016 at 5:06 AM, Sun, Rui  wrote:
>
> Spark configurations specified at the command line for spark-submit should
> be passed to the JVM inside Julia process. You can refer to
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L267
> and
> https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L295
>
> Generally,
>
> spark-submit JVM -> JuliaRunner -> Env var like
> “JULIA_SUBMIT_ARGS” -> julia process -> new JVM with SparkContext
>
>   Julia can pick the env var, and set the system properties or directly
> fill the configurations into a SparkConf, and then create a SparkContext
>
>
>
> Yes, you are right, `spark-submit` creates new Python/R process that
> connects back to that same JVM and creates SparkContext in it.
>
> Refer to
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L47
> and
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/RRunner.scala#L65
>
>
>
>
>
> *From:* Andrei [mailto:faithlessfri...@gmail.com]
> *Sent:* Wednesday, April 13, 2016 4:32 AM
> *To:* Sun, Rui 
> *Cc:* user 
> *Subject:* Re: How does spark-submit handle Python scripts (and how to
> repeat it)?
>
>
>
> One part is passing the command line options, like “--master”, from the
> JVM launched by spark-submit to the JVM where SparkContext resides
>
>
>
> Since I have full control over both - JVM and Julia parts - I can pass
> whatever options to both. But what exactly should be passed? Currently
> pipeline looks like this:
>
>
>
> spark-submit JVM -> JuliaRunner -> julia process -> new JVM with
> SparkContext
>
>
>
>  I want to make the last JVM's SparkContext to understand that it should
> run on YARN. Obviously, I can't pass `--master yarn` option to JVM itself.
> Instead, I can pass system property "spark.master" = "yarn-client", but
> this results in an error:
>
>
>
> Retrying connect to server: 0.0.0.0/0.0.0.0:8032
>
>
>
>
>
> So it's definitely not enough. I tried to set manually all system
> properties that `spark-submit` adds to the JVM (including
> "spark-submit=true", "spark.submit.deployMode=client", etc.), but it didn't
> help too. Source code is always good, but for a stranger like me it's a
> little bit hard to grasp control flow in SparkSubmit class.
>
>
>
>
>
> For pySpark & SparkR, when running scripts in client deployment modes
> (standalone client and yarn client), the JVM is the same (py4j/RBackend
> running as a thread in the JVM launched by spark-submit)
>
>
>
> Can you elaborate on this? Does it mean that `spark-submit` creates new
> Python/R process that connects back to that same JVM and creates
> SparkContext in it?
>
>
>
>
>
> On Tue, Apr 12, 2016 at 2:04 PM, Sun, Rui  wrote:
>
> There is much deployment preparation work handling different deployment
> modes for pyspark and SparkR in 

Re: Error with --files

2016-04-14 Thread Marcelo Vanzin
On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen  wrote:
>> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
>> /home/ubuntu/localtest.txt#appSees.txt

--files should come before the path to your python script. Otherwise
it's just passed as arguments to your script when it's run.

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Error with --files

2016-04-14 Thread Ted Yu
bq. localtest.txt#appSees.txt

Which file did you want to pass ?

Thanks

On Thu, Apr 14, 2016 at 2:14 PM, Benjamin Zaitlen 
wrote:

> Hi All,
>
> I'm trying to use the --files option with yarn:
>
> spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
>> /home/ubuntu/localtest.txt#appSees.txt
>
>
> I never see the file in HDFS or in the yarn containers.  Am I doing
> something incorrect ?
>
> I'm running spark 1.6.0
>
>
> Thanks,
> --Ben
>


Adding metadata information to parquet files

2016-04-14 Thread Manivannan Selvadurai
Hi All,

 I'm trying to ingest data form kafka as parquet files. I use spark 1.5.2
and I'm looking for a way to store the source schema in the parquet file
like the way you get to store the avro schema as a metadata info when using
the AvroParquetWriter. Any help much appreciated.


Can this performance be improved?

2016-04-14 Thread Bibudh Lahiri
Hi,
As part of a larger program, I am extracting the distinct values of
some columns of an RDD with 100 million records and 4 columns. I am running
Spark in standalone cluster mode on my laptop (2.3 GHz Intel Core i7, 10 GB
1333 MHz DDR3 RAM) with all the 8 cores given to a single worker. So my
statement is something like this:

age_groups = patients_rdd.map(lambda x:x.split(",")).map(lambda x:
x[1]).distinct()

   It is taking about 3.8 minutes. It is spawning 89 tasks when dealing
with this RDD because (I guess) the block size is 32 MB, and the entire
file is 2.8 GB, so there are 2.8*1024/32 = 89 blocks. The ~4 minute time
means it is processing about 50k records per second per core/task.

Does this performance look typical or is there room for improvement?

Thanks
Bibudh



-- 
Bibudh Lahiri
Data Scientist, Impetus Technolgoies
5300 Stevens Creek Blvd
San Jose, CA 95129
http://knowthynumbers.blogspot.com/


Re: Spark replacing Hadoop

2016-04-14 Thread Mich Talebzadeh
One can see from the responses that Big Data landscape is getting very
crowded with tools and there are dozens of alternatives offered. However,
as usual the laws of selection will gravitate towards solutions that are
scalable, reliable and more importantly cost effective.

To this end any commercial decision to acquire solutions as a technology
stack has to take into account the available skill sets in-house and the
stability of the products. I would concur with those that agree that a
smart solution will always require a good query engine, a mechanism to
organise the storage and the storage layer itself plus the resource
manager. The rests are icing on the cake.

To me Spark with Hive, HDFS and Yarn are winning combinations. Hadoop
encompasses HDFS and it is almost impossible to side step it without
finding a viable alternative as persistent storage. Also I take the point
that with already made investments in Hadoop, the exist barriers won't make
commercial sense. In other words one needs compelling arguments (besides
purely technical outlook) to replace Hadoop in this Financial climate that
technology dollars are a premium.

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 21:35, Peyman Mohajerian  wrote:

> Cloud adds another dimension:
> The fact that in cloud compute and storage is decoupled, s3-emr or
> blob-hdisight, means in cloud Hadoop ends up being more of a compute engine
> and a lot of the governance, security features are irrelevant or less
> important because data at rest is out of Hadoop.
> Currently the biggest reason to run Spark in Hadoop is Yarn (in cloud),
>  but if you decide to use Mesos/Standalone then again you may not need
> Hadoop. Databrick adds another dimension to this in cloud which I won't
> comment on.
>
> But on-premise I think you can argue that HDFS is here to stay in many
> forms, e.g. Isilon, object stores and other storage types not just local
> disk. HDFS API actually works over Azure's Data Lake Store completely
> independent of Hadoop!
>
> On Thu, Apr 14, 2016 at 1:29 PM, Cody Koeninger 
> wrote:
>
>> I've been using spark for years and have (thankfully) been able to
>> avoid needing HDFS, aside from one contract where it was already in
>> use.
>>
>> At this point, many of the people I know would consider Kafka to be
>> more important than HDFS.
>>
>> On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke 
>> wrote:
>> > I do not think so. Hadoop provides an ecosystem in which you can deploy
>> > different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive,
>> > solr... I observe also that commercial analytical tools use one or more
>> of
>> > these engines to execute their code in a distributed fashion. You  need
>> this
>> > flexibility to have an ecosystem suitable for your needs -especially In
>> the
>> > area of security. HDFS is one key element for the storage and locality.
>> > Spark itself cannot provide such a complete ecosystem but is part of
>> > ecosystems.
>> >
>> > On 14 Apr 2016, at 21:13, Ashok Kumar 
>> wrote:
>> >
>> > Hi,
>> >
>> > I hear that some saying that Hadoop is getting old and out of date and
>> will
>> > be replaced by Spark!
>> >
>> > Does this make sense and if so how accurate is it?
>> >
>> > Best
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Error with --files

2016-04-14 Thread Benjamin Zaitlen
Hi All,

I'm trying to use the --files option with yarn:

spark-submit --master yarn-cluster /home/ubuntu/test_spark.py --files
> /home/ubuntu/localtest.txt#appSees.txt


I never see the file in HDFS or in the yarn containers.  Am I doing
something incorrect ?

I'm running spark 1.6.0


Thanks,
--Ben


Re: Spark replacing Hadoop

2016-04-14 Thread Peyman Mohajerian
Cloud adds another dimension:
The fact that in cloud compute and storage is decoupled, s3-emr or
blob-hdisight, means in cloud Hadoop ends up being more of a compute engine
and a lot of the governance, security features are irrelevant or less
important because data at rest is out of Hadoop.
Currently the biggest reason to run Spark in Hadoop is Yarn (in cloud),
 but if you decide to use Mesos/Standalone then again you may not need
Hadoop. Databrick adds another dimension to this in cloud which I won't
comment on.

But on-premise I think you can argue that HDFS is here to stay in many
forms, e.g. Isilon, object stores and other storage types not just local
disk. HDFS API actually works over Azure's Data Lake Store completely
independent of Hadoop!

On Thu, Apr 14, 2016 at 1:29 PM, Cody Koeninger  wrote:

> I've been using spark for years and have (thankfully) been able to
> avoid needing HDFS, aside from one contract where it was already in
> use.
>
> At this point, many of the people I know would consider Kafka to be
> more important than HDFS.
>
> On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke  wrote:
> > I do not think so. Hadoop provides an ecosystem in which you can deploy
> > different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive,
> > solr... I observe also that commercial analytical tools use one or more
> of
> > these engines to execute their code in a distributed fashion. You  need
> this
> > flexibility to have an ecosystem suitable for your needs -especially In
> the
> > area of security. HDFS is one key element for the storage and locality.
> > Spark itself cannot provide such a complete ecosystem but is part of
> > ecosystems.
> >
> > On 14 Apr 2016, at 21:13, Ashok Kumar 
> wrote:
> >
> > Hi,
> >
> > I hear that some saying that Hadoop is getting old and out of date and
> will
> > be replaced by Spark!
> >
> > Does this make sense and if so how accurate is it?
> >
> > Best
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark replacing Hadoop

2016-04-14 Thread Cody Koeninger
I've been using spark for years and have (thankfully) been able to
avoid needing HDFS, aside from one contract where it was already in
use.

At this point, many of the people I know would consider Kafka to be
more important than HDFS.

On Thu, Apr 14, 2016 at 3:11 PM, Jörn Franke  wrote:
> I do not think so. Hadoop provides an ecosystem in which you can deploy
> different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive,
> solr... I observe also that commercial analytical tools use one or more of
> these engines to execute their code in a distributed fashion. You  need this
> flexibility to have an ecosystem suitable for your needs -especially In the
> area of security. HDFS is one key element for the storage and locality.
> Spark itself cannot provide such a complete ecosystem but is part of
> ecosystems.
>
> On 14 Apr 2016, at 21:13, Ashok Kumar  wrote:
>
> Hi,
>
> I hear that some saying that Hadoop is getting old and out of date and will
> be replaced by Spark!
>
> Does this make sense and if so how accurate is it?
>
> Best

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark replacing Hadoop

2016-04-14 Thread Jörn Franke
I do not think so. Hadoop provides an ecosystem in which you can deploy 
different engines, such as MR, HBase, TEZ, Spark, Flink, titandb, hive, solr... 
I observe also that commercial analytical tools use one or more of these 
engines to execute their code in a distributed fashion. You  need this 
flexibility to have an ecosystem suitable for your needs -especially In the 
area of security. HDFS is one key element for the storage and locality. 
Spark itself cannot provide such a complete ecosystem but is part of ecosystems.

> On 14 Apr 2016, at 21:13, Ashok Kumar  wrote:
> 
> Hi,
> 
> I hear that some saying that Hadoop is getting old and out of date and will 
> be replaced by Spark!
> 
> Does this make sense and if so how accurate is it?
> 
> Best


Re: Spark replacing Hadoop

2016-04-14 Thread Sean Owen
Depends indeed on what you mean by "Hadoop". The core Hadoop project
is MapReduce, YARN and HDFS. MapReduce is still in use as a workhorse
but superseded by engines like Spark (or perhaps Flink).  (Tez maps
loosely to Spark Core really, and is not really a MapReduce
replacement.)

"Hadoop" can also be a catch-all term for projects typically used
together in conjunction with core Hadoop. That can be Spark, Kafka,
HBase, ZK, Solr, Parquet, Impala, Hive, etc.

If you mean the former -- mostly no, Spark needs a storage layer like
HDFS for persistent storage, and needs to integrate with a cluster
manager like YARN in order to share resources with other apps, but
replaces MapReduce.

If you mean the latter -- no, Spark is a big piece of the broader
picture and replaces several pieces (Mahout, maybe Crunch in some
ways, Giraph, arguably takes on some of Hive's workloads), but doesn't
replace most of them.

Really, there's no reason to expect that one project will do
everything. Core Hadoop mostly certainly wasn't enough to handle all
the "Hadoop" workloads today. It's a false choice. You can use Spark
*and* Hadoop-related projects and that's the best of all.

On Thu, Apr 14, 2016 at 8:40 PM, Mich Talebzadeh
 wrote:
> Hi,
>
> My two cents here.
>
> Hadoop as I understand has two components namely HDFS (Hadoop Distributed
> File System) and MapReduce.
>
> Whatever we use I still think we need to store data on HDFS (excluding
> standalones like MongoDB etc.). Now moving to MapReduce as the execution
> engine that is replaced by TEZ (basically MapReduce with DAG) or with Spark
> which uses in memory capabilities and DAG. MapReduce is the one moving
> sideways.
>
> To me Spark besides being versatile is a powerful tool. Remember tools are
> just tools, not solutions so we can discuss this all day. Effectively I
> would argue that with Spark as the front end tool with Hive and its
> organisation for metadata plus HDFS as the storage layer, you have all three
> components to create a powerful solution.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
>
> On 14 April 2016 at 20:22, Andy Davidson 
> wrote:
>>
>> Hi Ashok
>>
>> In general if I was starting a new project and had not invested heavily in
>> hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of
>> existing projects implemented on hadoop, …) I would probably start using
>> spark. Its faster and easier to use
>>
>> Your mileage may vary
>>
>> Andy
>>
>> From: Ashok Kumar 
>> Reply-To: Ashok Kumar 
>> Date: Thursday, April 14, 2016 at 12:13 PM
>> To: "user @spark" 
>> Subject: Spark replacing Hadoop
>>
>> Hi,
>>
>> I hear that some saying that Hadoop is getting old and out of date and
>> will be replaced by Spark!
>>
>> Does this make sense and if so how accurate is it?
>>
>> Best
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone.

On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov 
wrote:

> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executors /
> memory you want to allocate to each app
>
> Great feature for regular users who just want to run Spark / Spark SQL
>
>
> On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen  wrote:
>
>> I don't think usage is the differentiating factor. YARN and standalone
>> are pretty well supported. If you are only running a Spark cluster by
>> itself with nothing else, standalone is probably simpler than setting
>> up YARN just for Spark. However if you're running on a cluster that
>> will host other applications, you'll need to integrate with a shared
>> resource manager and its security model, and for anything
>> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>>
>> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>>  wrote:
>> > AWS EMR includes Spark on Yarn
>> > Hortonworks and Cloudera platforms include Spark on Yarn as well
>> >
>> >
>> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
>> arkadiusz.b...@gmail.com>
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> >> production ?
>> >>
>> >> I would like to choose most supported and used technology in
>> >> production for our project.
>> >>
>> >>
>> >> BR,
>> >>
>> >> Arkadiusz Bicz
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>


Client process memory usage

2016-04-14 Thread Nisrina Luthfiyati
Hi all,
I have a python Spark application that I'm running using spark-submit in
yarn-cluster mode.
If I run ps -aux | grep  in the submitter node, I can
find the client process that submitted the application, usually with around
300-600 MB memory use (%MEM around 1.0-2.0 in a node with 30 GB memory).

Is there anything that I can do to make this smaller? Also, as far as I
know in yarn-cluster mode after the application is launched the client then
does nothing, what is the memory used for?

Thank you,
Nisrina.


Re: Spark replacing Hadoop

2016-04-14 Thread Arunkumar Chandrasekar
Hello,

I would stand in side of Spark. Spark provides numerous add-ons like Spark SQL, 
Spark MLIB that are possibly something hard to set it up with Map Reduce. 

Thank You.



> On Apr 15, 2016, at 1:16 AM, Ashok Kumar  wrote:
> 
> Hello,
> 
> Well, Sounds like Andy is implying that Spark can replace Hadoop whereas Mich 
> still believes that HDFS is a keeper?
> 
> thanks
> 
> 
> 
> 
> On Thursday, 14 April 2016, 20:40, David Newberger 
>  wrote:
> 
> 
> Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do 
> you literally mean replacing the whole of Hadoop?
>  
> David
>  
> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID] 
> Sent: Thursday, April 14, 2016 2:13 PM
> To: User
> Subject: Spark replacing Hadoop
>  
> Hi,
>  
> I hear that some saying that Hadoop is getting old and out of date and will 
> be replaced by Spark!
>  
> Does this make sense and if so how accurate is it?
>  
> Best
> 
> 


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
Spark on Yarn supports dynamic resource allocation

So, you can run several spark-shells / spark-submits / spark-jobserver /
zeppelin on one cluster without defining upfront how many executors /
memory you want to allocate to each app

Great feature for regular users who just want to run Spark / Spark SQL


On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen  wrote:

> I don't think usage is the differentiating factor. YARN and standalone
> are pretty well supported. If you are only running a Spark cluster by
> itself with nothing else, standalone is probably simpler than setting
> up YARN just for Spark. However if you're running on a cluster that
> will host other applications, you'll need to integrate with a shared
> resource manager and its security model, and for anything
> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>
> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>  wrote:
> > AWS EMR includes Spark on Yarn
> > Hortonworks and Cloudera platforms include Spark on Yarn as well
> >
> >
> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
> arkadiusz.b...@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
> >> production ?
> >>
> >> I would like to choose most supported and used technology in
> >> production for our project.
> >>
> >>
> >> BR,
> >>
> >> Arkadiusz Bicz
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hello,
Well, Sounds like Andy is implying that Spark can replace Hadoop whereas Mich 
still believes that HDFS is a keeper?
thanks

 

On Thursday, 14 April 2016, 20:40, David Newberger 
 wrote:
 

 #yiv4514430231 #yiv4514430231 -- _filtered #yiv4514430231 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv4514430231 
{font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv4514430231 
#yiv4514430231 p.yiv4514430231MsoNormal, #yiv4514430231 
li.yiv4514430231MsoNormal, #yiv4514430231 div.yiv4514430231MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv4514430231 a:link, 
#yiv4514430231 span.yiv4514430231MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv4514430231 a:visited, #yiv4514430231 
span.yiv4514430231MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv4514430231 
span.yiv4514430231EmailStyle17 {color:#1F497D;}#yiv4514430231 
.yiv4514430231MsoChpDefault {font-size:10.0pt;} _filtered #yiv4514430231 
{margin:1.0in 1.0in 1.0in 1.0in;}#yiv4514430231 div.yiv4514430231WordSection1 
{}#yiv4514430231 Can we assume your question is “Will Spark replace Hadoop 
MapReduce?” or do you literally mean replacing the whole of Hadoop?    David    
From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Thursday, April 14, 2016 2:13 PM
To: User
Subject: Spark replacing Hadoop    Hi,    I hear that some saying that Hadoop 
is getting old and out of date and will be replaced by Spark!    Does this make 
sense and if so how accurate is it?    Best 

  

Re: Spark replacing Hadoop

2016-04-14 Thread Felipe Gustavo
Hi Ashok,

In my opinion, we should look at Hadoop as a general purpose Framework that
supports multiple models and we should look at Spark as an alternative to
Hadoop MapReduce rather than a replacement to Hadoop ecosystem (for
instance, Spark is not replacing Zookeper, HDFS, etc)

Regards

On Thu, Apr 14, 2016 at 4:22 PM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Ashok
>
> In general if I was starting a new project and had not invested heavily in
> hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of
> existing projects implemented on hadoop, …) I would probably start using
> spark. Its faster and easier to use
>
> Your mileage may vary
>
> Andy
>
> From: Ashok Kumar 
> Reply-To: Ashok Kumar 
> Date: Thursday, April 14, 2016 at 12:13 PM
> To: "user @spark" 
> Subject: Spark replacing Hadoop
>
> Hi,
>
> I hear that some saying that Hadoop is getting old and out of date and
> will be replaced by Spark!
>
> Does this make sense and if so how accurate is it?
>
> Best
>
>


RE: Spark replacing Hadoop

2016-04-14 Thread David Newberger
Can we assume your question is “Will Spark replace Hadoop MapReduce?” or do you 
literally mean replacing the whole of Hadoop?

David

From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID]
Sent: Thursday, April 14, 2016 2:13 PM
To: User
Subject: Spark replacing Hadoop

Hi,

I hear that some saying that Hadoop is getting old and out of date and will be 
replaced by Spark!

Does this make sense and if so how accurate is it?

Best


Re: Spark replacing Hadoop

2016-04-14 Thread Mich Talebzadeh
Hi,

My two cents here.

Hadoop as I understand has two components namely HDFS (Hadoop Distributed
File System) and MapReduce.

Whatever we use I still think we need to store data on HDFS (excluding
standalones like MongoDB etc.). Now moving to MapReduce as the execution
engine that is replaced by TEZ (basically MapReduce with DAG) or with Spark
which uses in memory capabilities and DAG. MapReduce is the one moving
sideways.

To me Spark besides being versatile is a powerful tool. Remember tools are
just tools, not solutions so we can discuss this all day. Effectively I
would argue that with Spark as the front end tool with Hive and its
organisation for metadata plus HDFS as the storage layer, you have all
three components to create a powerful solution.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 20:22, Andy Davidson 
wrote:

> Hi Ashok
>
> In general if I was starting a new project and had not invested heavily in
> hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of
> existing projects implemented on hadoop, …) I would probably start using
> spark. Its faster and easier to use
>
> Your mileage may vary
>
> Andy
>
> From: Ashok Kumar 
> Reply-To: Ashok Kumar 
> Date: Thursday, April 14, 2016 at 12:13 PM
> To: "user @spark" 
> Subject: Spark replacing Hadoop
>
> Hi,
>
> I hear that some saying that Hadoop is getting old and out of date and
> will be replaced by Spark!
>
> Does this make sense and if so how accurate is it?
>
> Best
>
>


Re: Spark replacing Hadoop

2016-04-14 Thread Andy Davidson
Hi Ashok

In general if I was starting a new project and had not invested heavily in
hadoop (i.e. Had a large staff that was trained on hadoop, had a lot of
existing projects implemented on hadoop, Š) I would probably start using
spark. Its faster and easier to use

Your mileage may vary

Andy

From:  Ashok Kumar 
Reply-To:  Ashok Kumar 
Date:  Thursday, April 14, 2016 at 12:13 PM
To:  "user @spark" 
Subject:  Spark replacing Hadoop

> Hi,
> 
> I hear that some saying that Hadoop is getting old and out of date and will be
> replaced by Spark!
> 
> Does this make sense and if so how accurate is it?
> 
> Best




Spark replacing Hadoop

2016-04-14 Thread Ashok Kumar
Hi,
I hear that some saying that Hadoop is getting old and out of date and will be 
replaced by Spark!
Does this make sense and if so how accurate is it?
Best

Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Sean Owen
I don't think usage is the differentiating factor. YARN and standalone
are pretty well supported. If you are only running a Spark cluster by
itself with nothing else, standalone is probably simpler than setting
up YARN just for Spark. However if you're running on a cluster that
will host other applications, you'll need to integrate with a shared
resource manager and its security model, and for anything
Hadoop-related that's YARN. Standalone wouldn't make as much sense.

On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
 wrote:
> AWS EMR includes Spark on Yarn
> Hortonworks and Cloudera platforms include Spark on Yarn as well
>
>
> On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz 
> wrote:
>>
>> Hello,
>>
>> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> production ?
>>
>> I would like to choose most supported and used technology in
>> production for our project.
>>
>>
>> BR,
>>
>> Arkadiusz Bicz
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



JSON Usage

2016-04-14 Thread Benjamin Kim
I was wonder what would be the best way to use JSON in Spark/Scala. I need to 
lookup values of fields in a collection of records to form a URL and download 
that file at that location. I was thinking an RDD would be perfect for this. I 
just want to hear from others who might have more experience in this. Below is 
the actual JSON structure that I am trying to use for the S3 bucket and key 
values of each “record" within “Records".

{  
   "Records":[  
  {  
 "eventVersion":"2.0",
 "eventSource":"aws:s3",
 "awsRegion":"us-east-1",
 "eventTime":The time, in ISO-8601 format, for example, 
1970-01-01T00:00:00.000Z, when S3 finished processing the request,
 "eventName":"event-type",
 "userIdentity":{  
"principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
 },
 "requestParameters":{  
"sourceIPAddress":"ip-address-where-request-came-from"
 },
 "responseElements":{  
"x-amz-request-id":"Amazon S3 generated request ID",
"x-amz-id-2":"Amazon S3 host that processed the request"
 },
 "s3":{  
"s3SchemaVersion":"1.0",
"configurationId":"ID found in the bucket notification 
configuration",
"bucket":{  
   "name":"bucket-name",
   "ownerIdentity":{  
  "principalId":"Amazon-customer-ID-of-the-bucket-owner"
   },
   "arn":"bucket-ARN"
},
"object":{  
   "key":"object-key",
   "size":object-size,
   "eTag":"object eTag",
   "versionId":"object version if bucket is versioning-enabled, 
otherwise null",
   "sequencer": "a string representation of a hexadecimal value 
used to determine event sequence, 
   only used with PUTs and DELETEs"
}
 }
  },
  {
  // Additional events
  }
   ]
}

Thanks
Ben
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Mich Talebzadeh
Hi Josh,

Can you please clarify whether date comparisons as two strings work at all?

I was under the impression is that with string comparison only first
characters are compared?

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 19:26, Josh Rosen  wrote:

> AFAIK this is not being pushed down because it involves an implicit cast
> and we currently don't push casts into data sources or scans; see
> https://github.com/databricks/spark-redshift/issues/155 for a
> possibly-related discussion.
>
> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Are you comparing strings in here or timestamp?
>>
>> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
>> (cast(registration#37 as string) <= 2015-05-29))
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 14 April 2016 at 18:04, Kiran Chitturi 
>> wrote:
>>
>>> Hi,
>>>
>>> Timestamp range filter queries in SQL are not getting pushed down to the
>>> PrunedFilteredScan instances. The filtering is happening at the Spark layer.
>>>
>>> The physical plan for timestamp range queries is not showing the pushed
>>> filters where as range queries on other types is working fine as the
>>> physical plan is showing the pushed filters.
>>>
>>> Please see below for code and examples.
>>>
>>> *Example:*
>>>
>>> *1.* Range filter queries on Timestamp types
>>>
>>>*code: *
>>>
 sqlContext.sql("SELECT * from events WHERE `registration` >=
 '2015-05-28' AND `registration` <= '2015-05-29' ")
>>>
>>>*Full example*:
>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>> *plan*:
>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>>
>>> *2. * Range filter queries on Long types
>>>
>>> *code*:
>>>
 sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
 `length` <= '1000'")
>>>
>>> *Full example*:
>>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>>> *plan*:
>>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>>
>>> The SolrRelation class we use extends
>>> 
>>> the PrunedFilteredScan.
>>>
>>> Since Solr supports date ranges, I would like for the timestamp filters
>>> to be pushed down to the Solr query.
>>>
>>> Are there limitations on the type of filters that are passed down with
>>> Timestamp types ?
>>> Is there something that I should do in my code to fix this ?
>>>
>>> Thanks,
>>> --
>>> Kiran Chitturi
>>>
>>>
>>


Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Josh Rosen
AFAIK this is not being pushed down because it involves an implicit cast
and we currently don't push casts into data sources or scans; see
https://github.com/databricks/spark-redshift/issues/155 for a
possibly-related discussion.

On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh 
wrote:

> Are you comparing strings in here or timestamp?
>
> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
> (cast(registration#37 as string) <= 2015-05-29))
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 April 2016 at 18:04, Kiran Chitturi 
> wrote:
>
>> Hi,
>>
>> Timestamp range filter queries in SQL are not getting pushed down to the
>> PrunedFilteredScan instances. The filtering is happening at the Spark layer.
>>
>> The physical plan for timestamp range queries is not showing the pushed
>> filters where as range queries on other types is working fine as the
>> physical plan is showing the pushed filters.
>>
>> Please see below for code and examples.
>>
>> *Example:*
>>
>> *1.* Range filter queries on Timestamp types
>>
>>*code: *
>>
>>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>>
>>*Full example*:
>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>> *plan*:
>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>>
>> *2. * Range filter queries on Long types
>>
>> *code*:
>>
>>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>>> `length` <= '1000'")
>>
>> *Full example*:
>> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
>> *plan*:
>> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>>
>> The SolrRelation class we use extends
>> 
>> the PrunedFilteredScan.
>>
>> Since Solr supports date ranges, I would like for the timestamp filters
>> to be pushed down to the Solr query.
>>
>> Are there limitations on the type of filters that are passed down with
>> Timestamp types ?
>> Is there something that I should do in my code to fix this ?
>>
>> Thanks,
>> --
>> Kiran Chitturi
>>
>>
>


Re: EMR Spark log4j and metrics

2016-04-14 Thread Peter Halliday
An update to this is that I can see the log4j.properties files and the
metrics.properties files correctly on the master.  When I submit a Spark
Step that runs Spark in deploy mode of cluster, I see the cluster files
being zipped up and pushed via hdfs to the driver and workers.  However, I
don't see evidence than the configuration files are read from or used after
they pushed

On Wed, Apr 13, 2016 at 11:22 AM, Peter Halliday  wrote:

> I have an existing cluster that I stand up via Docker images and
> CloudFormation Templates  on AWS.  We are moving to EMR and AWS Data
> Pipeline process, and having problems with metrics and log4j.  We’ve sent a
> JSON configuration for spark-log4j and spark-metrics.  The log4j file seems
> to be basically working for the master.  However, the driver and executors
> it isn’t working for.  I’m not sure why.  Also, the metrics aren’t working
> anywhere. It’s using a cloud watch to log the metrics, and there’s no
> CloudWatch Sink for Spark it seems on EMR, and so we created one that we
> added to a jar than’s sent via —jars to spark-submit.
>
> Peter Halliday


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mich Talebzadeh
Hi Alex,

Do you mean using Spark with Yarn-client compared to using Spark Local?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 18:46, Alexander Pivovarov  wrote:

> AWS EMR includes Spark on Yarn
> Hortonworks and Cloudera platforms include Spark on Yarn as well
>
>
> On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz 
> wrote:
>
>> Hello,
>>
>> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> production ?
>>
>> I would like to choose most supported and used technology in
>> production for our project.
>>
>>
>> BR,
>>
>> Arkadiusz Bicz
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
AWS EMR includes Spark on Yarn
Hortonworks and Cloudera platforms include Spark on Yarn as well


On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz 
wrote:

> Hello,
>
> Is there any statistics regarding YARN vs Standalone Spark Usage in
> production ?
>
> I would like to choose most supported and used technology in
> production for our project.
>
>
> BR,
>
> Arkadiusz Bicz
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Mich Talebzadeh
Are you comparing strings in here or timestamp?

Filter ((cast(registration#37 as string) >= 2015-05-28) &&
(cast(registration#37 as string) <= 2015-05-29))


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 18:04, Kiran Chitturi 
wrote:

> Hi,
>
> Timestamp range filter queries in SQL are not getting pushed down to the
> PrunedFilteredScan instances. The filtering is happening at the Spark layer.
>
> The physical plan for timestamp range queries is not showing the pushed
> filters where as range queries on other types is working fine as the
> physical plan is showing the pushed filters.
>
> Please see below for code and examples.
>
> *Example:*
>
> *1.* Range filter queries on Timestamp types
>
>*code: *
>
>> sqlContext.sql("SELECT * from events WHERE `registration` >=
>> '2015-05-28' AND `registration` <= '2015-05-29' ")
>
>*Full example*:
> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
> *plan*:
> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql
>
> *2. * Range filter queries on Long types
>
> *code*:
>
>> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
>> `length` <= '1000'")
>
> *Full example*:
> https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
> *plan*:
> https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql
>
> The SolrRelation class we use extends
> 
> the PrunedFilteredScan.
>
> Since Solr supports date ranges, I would like for the timestamp filters to
> be pushed down to the Solr query.
>
> Are there limitations on the type of filters that are passed down with
> Timestamp types ?
> Is there something that I should do in my code to fix this ?
>
> Thanks,
> --
> Kiran Chitturi
>
>


Spark/Parquet

2016-04-14 Thread Younes Naguib
Hi all,

When parquet 2.0 planned in Spark?
Or is it already?


Younes Naguib
Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | 
younes.nag...@tritondigital.com 



Spark sql not pushing down timestamp range queries

2016-04-14 Thread Kiran Chitturi
Hi,

Timestamp range filter queries in SQL are not getting pushed down to the
PrunedFilteredScan instances. The filtering is happening at the Spark layer.

The physical plan for timestamp range queries is not showing the pushed
filters where as range queries on other types is working fine as the
physical plan is showing the pushed filters.

Please see below for code and examples.

*Example:*

*1.* Range filter queries on Timestamp types

   *code: *

> sqlContext.sql("SELECT * from events WHERE `registration` >= '2015-05-28'
> AND `registration` <= '2015-05-29' ")

   *Full example*:
https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
*plan*:
https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql

*2. * Range filter queries on Long types

*code*:

> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and `length`
> <= '1000'")

*Full example*:
https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
*plan*:
https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql

The SolrRelation class we use extends

the PrunedFilteredScan.

Since Solr supports date ranges, I would like for the timestamp filters to
be pushed down to the Solr query.

Are there limitations on the type of filters that are passed down with
Timestamp types ?
Is there something that I should do in my code to fix this ?

Thanks,
-- 
Kiran Chitturi


Re: Sqoop on Spark

2016-04-14 Thread Mich Talebzadeh
Hi,

"SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in
4 threads, that is 246 GB of data."

Could you please give the source of the database and where was it (on the
same host as Hive or another host).

thanks


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 16:31, Jörn Franke  wrote:

> They wanted to have alternatives. I recommended the original approach of
> simply using sqoop.
>
> On 14 Apr 2016, at 16:09, Gourav Sengupta 
> wrote:
>
> Hi,
>
> SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in
> 4 threads, that is 246 GB of data.
>
> Why is the discussion about using anything other than SQOOP still so
> wonderfully on?
>
>
> Regards,
> Gourav
>
> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke  wrote:
>
>> Actually I was referring to have a an external table in Oracle, which is
>> used to export to CSV (insert into). Then you have a csv on the database
>> server which needs to be moved to HDFS.
>>
>> On 11 Apr 2016, at 17:50, Michael Segel 
>> wrote:
>>
>> Depending on the Oracle release…
>>
>> You could use webHDFS to gain access to the cluster and see the CSV file
>> as an external table.
>>
>> However, you would need to have an application that will read each block
>> of the file in parallel. This works for loading in to the RDBMS itself.
>> Actually you could use sqoop in reverse to push data to the RDBMS provided
>> that the block file is splittable.  This is a classic M/R problem.
>>
>> But I don’t think this is what the OP wants to do. They want to pull data
>> from the RDBMs. If you could drop the table’s underlying file and can read
>> directly from it… you can do a very simple bulk load/unload process.
>> However you need to know the file’s format.
>>
>> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data.
>>
>> As I and other posters to this thread have alluded to… this would be a
>> block bulk load/unload tool.
>>
>>
>> On Apr 10, 2016, at 11:31 AM, Jörn Franke  wrote:
>>
>>
>> I am not 100% sure, but you could export to CSV in Oracle using external
>> tables.
>>
>> Oracle has also the Hadoop Loader, which seems to support Avro. However,
>> I think you need to buy the Big Data solution.
>>
>> On 10 Apr 2016, at 16:12, Mich Talebzadeh 
>> wrote:
>>
>> Yes I meant MR.
>>
>> Again one cannot beat the RDBMS export utility. I was specifically
>> referring to Oracle in above case that does not provide any specific text
>> bases export except the binary one Exp, data pump etc).
>>
>> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy)
>> that can be parallelised either through range partitioning or simple round
>> robin partitioning that can be used to get data out to file in parallel.
>> Then once get data into Hive table through import etc.
>>
>> In general if the source table is very large you can used either SAP
>> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both
>> these replication tools provide connectors to Hive and they do a good job.
>> If one has something like Oracle in Prod then there is likely a Golden Gate
>> there. For bulk setting of Hive tables and data migration, replication
>> server is good option.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 10 April 2016 at 14:24, Michael Segel 
>> wrote:
>>
>>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)
>>>
>>> The largest problem with sqoop is that in order to gain parallelism you
>>> need to know how your underlying table is partitioned and to do multiple
>>> range queries. This may not be known, or your data may or may not be
>>> equally distributed across the ranges.
>>>
>>> If you’re bringing over the entire table, you may find dropping it and
>>> then moving it to HDFS and then doing a bulk load to be more efficient.
>>> (This is less flexible than sqoop, but also stresses the database
>>> servers less. )
>>>
>>> Again, YMMV
>>>
>>>
>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh 
>>> wrote:
>>>
>>> Well unless you have plenty of memory, you are going to have certain
>>> issues with Spark.
>>>
>>> I tried to load a billion rows table from oracle through spark using
>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap
>>> space" error.
>>>
>>> Sqoop uses MapR and does it in serial mode which takes time and you can
>>> also tell it to 

Re: [ERROR]: Spark 1.5.2 + Hbase 1.1 + Hive 1.2 + HbaseIntegration

2016-04-14 Thread Teng Qiu
forward you this mails, hope these can help you, you can take a look
at this post 
http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html

2016-03-04 3:30 GMT+01:00 Divya Gehlot :
> Hi Teng,
>
> Thanks for the link you shared , helped me figure out the missing
> dependency.
> Was missing hbase-hadoop-compat.jar
>
>
>
>
>
> Thanks a lot,
>
> Divya
>
> On 2 March 2016 at 17:05, Teng Qiu  wrote:
>>
>> Hi, maybe the dependencies described in
>> http://www.abcn.net/2014/07/lighting-spark-with-hbase-full-edition.html
>> can help, add hive-hbase handler jar as well for HiveIntegration in
>> spark
>>
>> 2016-03-02 2:19 GMT+01:00 Divya Gehlot :
>> > Hello Teng,
>> > As you could see in chain email.
>> > I am facing lots of  issues while trying to connect to hbase  registered
>> > hive table.
>> > Could your pls help me with the list of jars which needs to be place in
>> > spark classpath?
>> > Would be very grateful you could send me the steps to follow .
>> > Would really appreciate the help.
>> > Thanks,
>> > Divya
>> >
>> > On Mar 2, 2016 4:50 AM, "Teng Qiu"  wrote:
>> >>
>> >> and also make sure that hbase-site.xml is set in your classpath on all
>> >> nodes, both master and workers, and also client.
>> >>
>> >> normally i put it into $SPARK_HOME/conf/ then the spark cluster will
>> >> be started with this conf file.
>> >>
>> >> btw. @Ted, did you tried insert into hbase table with spark's
>> >> HiveContext? i got this issue:
>> >> https://issues.apache.org/jira/browse/SPARK-6628
>> >>
>> >> and there is a patch available:
>> >> https://issues.apache.org/jira/browse/HIVE-11166
>> >>
>> >>
>> >> 2016-03-01 15:16 GMT+01:00 Ted Yu :
>> >> > 16/03/01 01:36:31 WARN TaskSetManager: Lost task 0.0 in stage 0.0
>> >> > (TID
>> >> > 0,
>> >> > ip-xxx-xx-xx-xxx.ap-southeast-1.compute.internal):
>> >> > java.lang.RuntimeException: hbase-default.xml file seems to be for an
>> >> > older
>> >> > version of HBase (null), this version is 1.1.2.2.3.4.0-3485
>> >> >
>> >> > The above was likely caused by some component being built with
>> >> > different
>> >> > release of hbase.
>> >> >
>> >> > Try setting "hbase.defaults.for.version.skip" to true.
>> >> >
>> >> > Cheers
>> >> >
>> >> >
>> >> > On Mon, Feb 29, 2016 at 9:12 PM, Ted Yu  wrote:
>> >> >>
>> >> >> 16/02/29 23:09:34 INFO ZooKeeper: Initiating client connection,
>> >> >> connectString=localhost:2181 sessionTimeout=9
>> >> >> watcher=hconnection-0x26fa89a20x0, quorum=localhost:2181,
>> >> >> baseZNode=/hbase
>> >> >>
>> >> >> Since baseZNode didn't match what you set in hbase-site.xml, the
>> >> >> cause
>> >> >> was
>> >> >> likely that hbase-site.xml being inaccessible to your Spark job.
>> >> >>
>> >> >> Please add it in your classpath.
>> >> >>
>> >> >> On Mon, Feb 29, 2016 at 8:42 PM, Ted Yu  wrote:
>> >> >>>
>> >> >>> 16/02/29 23:09:34 INFO ClientCnxn: Opening socket connection to
>> >> >>> server
>> >> >>> localhost/0:0:0:0:0:0:0:1:2181. Will not attempt to authenticate
>> >> >>> using
>> >> >>> SASL
>> >> >>> (unknown error)
>> >> >>>
>> >> >>> Is your cluster secure cluster ?
>> >> >>>
>> >> >>> bq. Trace :
>> >> >>>
>> >> >>> Was there any output after 'Trace :' ?
>> >> >>>
>> >> >>> Was hbase-site.xml accessible to your Spark job ?
>> >> >>>
>> >> >>> Thanks
>> >> >>>
>> >> >>> On Mon, Feb 29, 2016 at 8:27 PM, Divya Gehlot
>> >> >>> 
>> >> >>> wrote:
>> >> 
>> >>  Hi,
>> >>  I am getting error when I am trying to connect hive table (which
>> >>  is
>> >>  being created through HbaseIntegration) in spark
>> >> 
>> >>  Steps I followed :
>> >>  Hive Table creation code  :
>> >>  CREATE EXTERNAL TABLE IF NOT EXISTS TEST(NAME STRING,AGE INT)
>> >>  STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>> >>  WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,0:AGE")
>> >>  TBLPROPERTIES ("hbase.table.name" = "TEST",
>> >>  "hbase.mapred.output.outputtable" = "TEST");
>> >> 
>> >> 
>> >>  DESCRIBE TEST ;
>> >>  col_namedata_typecomment
>> >>  namestring from deserializer
>> >>  age   int from deserializer
>> >> 
>> >> 
>> >>  Spark Code :
>> >>  import org.apache.spark._
>> >>  import org.apache.spark.sql._
>> >> 
>> >>  val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>> >>  hiveContext.sql("from TEST SELECT  NAME").collect.foreach(println)
>> >> 
>> >> 
>> >>  Starting Spark shell
>> >>  spark-shell --jars
>> >> 
>> >> 
>> >>  

Re: Sqoop on Spark

2016-04-14 Thread Jörn Franke
They wanted to have alternatives. I recommended the original approach of simply 
using sqoop.

> On 14 Apr 2016, at 16:09, Gourav Sengupta  wrote:
> 
> Hi,
> 
> SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in 4 
> threads, that is 246 GB of data.
> 
> Why is the discussion about using anything other than SQOOP still so 
> wonderfully on?
> 
> 
> Regards,
> Gourav
> 
>> On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke  wrote:
>> Actually I was referring to have a an external table in Oracle, which is 
>> used to export to CSV (insert into). Then you have a csv on the database 
>> server which needs to be moved to HDFS.
>> 
>>> On 11 Apr 2016, at 17:50, Michael Segel  wrote:
>>> 
>>> Depending on the Oracle release… 
>>> 
>>> You could use webHDFS to gain access to the cluster and see the CSV file as 
>>> an external table. 
>>> 
>>> However, you would need to have an application that will read each block of 
>>> the file in parallel. This works for loading in to the RDBMS itself.  
>>> Actually you could use sqoop in reverse to push data to the RDBMS provided 
>>> that the block file is splittable.  This is a classic M/R problem. 
>>> 
>>> But I don’t think this is what the OP wants to do. They want to pull data 
>>> from the RDBMs. If you could drop the table’s underlying file and can read 
>>> directly from it… you can do a very simple bulk load/unload process. 
>>> However you need to know the file’s format. 
>>> 
>>> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data. 
>>> 
>>> As I and other posters to this thread have alluded to… this would be a 
>>> block bulk load/unload tool. 
>>> 
>>> 
 On Apr 10, 2016, at 11:31 AM, Jörn Franke  wrote:
 
 
 I am not 100% sure, but you could export to CSV in Oracle using external 
 tables.
 
 Oracle has also the Hadoop Loader, which seems to support Avro. However, I 
 think you need to buy the Big Data solution.
 
> On 10 Apr 2016, at 16:12, Mich Talebzadeh  
> wrote:
> 
> Yes I meant MR.
> 
> Again one cannot beat the RDBMS export utility. I was specifically 
> referring to Oracle in above case that does not provide any specific text 
> bases export except the binary one Exp, data pump etc).
> 
> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy) 
> that can be parallelised either through range partitioning or simple 
> round robin partitioning that can be used to get data out to file in 
> parallel. Then once get data into Hive table through import etc.
> 
> In general if the source table is very large you can used either SAP 
> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both 
> these replication tools provide connectors to Hive and they do a good 
> job. If one has something like Oracle in Prod then there is likely a 
> Golden Gate there. For bulk setting of Hive tables and data migration, 
> replication server is good option.
> 
> HTH
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 10 April 2016 at 14:24, Michael Segel  
>> wrote:
>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce) 
>> 
>> The largest problem with sqoop is that in order to gain parallelism you 
>> need to know how your underlying table is partitioned and to do multiple 
>> range queries. This may not be known, or your data may or may not be 
>> equally distributed across the ranges.  
>> 
>> If you’re bringing over the entire table, you may find dropping it and 
>> then moving it to HDFS and then doing a bulk load to be more efficient.
>> (This is less flexible than sqoop, but also stresses the database 
>> servers less. ) 
>> 
>> Again, YMMV
>> 
>> 
>>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh  
>>> wrote:
>>> 
>>> Well unless you have plenty of memory, you are going to have certain 
>>> issues with Spark.
>>> 
>>> I tried to load a billion rows table from oracle through spark using 
>>> JDBC and ended up with "Caused by: java.lang.OutOfMemoryError: Java 
>>> heap space" error.
>>> 
>>> Sqoop uses MapR and does it in serial mode which takes time and you can 
>>> also tell it to create Hive table. However, it will import data into 
>>> Hive table.
>>> 
>>> In any case the mechanism of data import is through JDBC, Spark uses 
>>> memory and DAG, whereas Sqoop relies on MapR.
>>> 
>>> There is of course another alternative.
>>> 
>>> Assuming that your Oracle table has a primary 

Exposing temp table via Hive Thrift server

2016-04-14 Thread ram kumar
Hi,

In spark-shell (scala), we import,
*org.apache.spark.sql.hive.thriftserver._*
for starting Hive Thrift server programatically for particular hive context
as
*HiveThriftServer2.startWithContext(hiveContext)*
to expose registered temp table for that particular session.

We used pyspark for creating dataframe
Is there package on python for importing HiveThriftServer

Thanks


YARN vs Standalone Spark Usage in production

2016-04-14 Thread Arkadiusz Bicz
Hello,

Is there any statistics regarding YARN vs Standalone Spark Usage in
production ?

I would like to choose most supported and used technology in
production for our project.


BR,

Arkadiusz Bicz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Strange bug: Filter problem with parenthesis

2016-04-14 Thread Saif.A.Ellafi
Appreciated Michael, but this doesn’t help my case, the filter string is being 
submitted from outside my program, is there any other alternative? some literal 
string parser or anything I can do before?

Saif

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, April 13, 2016 6:29 PM
To: Ellafi, Saif A.
Cc: user
Subject: Re: Strange bug: Filter problem with parenthesis

You need to use `backticks` to reference columns that have non-standard 
characters.

On Wed, Apr 13, 2016 at 6:56 AM, 
> wrote:
Hi,

I am debugging a program, and for some reason, a line calling the following is 
failing:

df.filter("sum(OpenAccounts) > 5").show

It says it cannot find the column OpenAccounts, as if it was applying the sum() 
function and looking for a column called like that, where there is not. This 
works fine if I rename the column to something without parenthesis.

I can’t reproduce this issue in Spark Shell (1.6.0), any ideas on how can I 
analyze this? This is an aggregation result, with the default column names 
afterwards.

PS: Workaround is to use toDF(cols) and rename all columns, but I am wondering 
if toDF has any impact on the RDD structure behind (e.g. repartitioning, cache, 
etc)

Appreciated,
Saif




Re: Sqoop on Spark

2016-04-14 Thread Gourav Sengupta
Hi,

SQOOP just extracted for me 1,253,015,160 records in 30 minutes running in
4 threads, that is 246 GB of data.

Why is the discussion about using anything other than SQOOP still so
wonderfully on?


Regards,
Gourav

On Mon, Apr 11, 2016 at 6:26 PM, Jörn Franke  wrote:

> Actually I was referring to have a an external table in Oracle, which is
> used to export to CSV (insert into). Then you have a csv on the database
> server which needs to be moved to HDFS.
>
> On 11 Apr 2016, at 17:50, Michael Segel  wrote:
>
> Depending on the Oracle release…
>
> You could use webHDFS to gain access to the cluster and see the CSV file
> as an external table.
>
> However, you would need to have an application that will read each block
> of the file in parallel. This works for loading in to the RDBMS itself.
> Actually you could use sqoop in reverse to push data to the RDBMS provided
> that the block file is splittable.  This is a classic M/R problem.
>
> But I don’t think this is what the OP wants to do. They want to pull data
> from the RDBMs. If you could drop the table’s underlying file and can read
> directly from it… you can do a very simple bulk load/unload process.
> However you need to know the file’s format.
>
> Not sure what IBM or Oracle has done to tie their RDBMs to Big Data.
>
> As I and other posters to this thread have alluded to… this would be a
> block bulk load/unload tool.
>
>
> On Apr 10, 2016, at 11:31 AM, Jörn Franke  wrote:
>
>
> I am not 100% sure, but you could export to CSV in Oracle using external
> tables.
>
> Oracle has also the Hadoop Loader, which seems to support Avro. However, I
> think you need to buy the Big Data solution.
>
> On 10 Apr 2016, at 16:12, Mich Talebzadeh 
> wrote:
>
> Yes I meant MR.
>
> Again one cannot beat the RDBMS export utility. I was specifically
> referring to Oracle in above case that does not provide any specific text
> bases export except the binary one Exp, data pump etc).
>
> In case of SAPO ASE, Sybase IQ, and MSSQL, one can use BCP (bulk copy)
> that can be parallelised either through range partitioning or simple round
> robin partitioning that can be used to get data out to file in parallel.
> Then once get data into Hive table through import etc.
>
> In general if the source table is very large you can used either SAP
> Replication Server (SRS) or Oracle Golden Gate to get data to Hive. Both
> these replication tools provide connectors to Hive and they do a good job.
> If one has something like Oracle in Prod then there is likely a Golden Gate
> there. For bulk setting of Hive tables and data migration, replication
> server is good option.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 10 April 2016 at 14:24, Michael Segel 
> wrote:
>
>> Sqoop doesn’t use MapR… unless you meant to say M/R (Map Reduce)
>>
>> The largest problem with sqoop is that in order to gain parallelism you
>> need to know how your underlying table is partitioned and to do multiple
>> range queries. This may not be known, or your data may or may not be
>> equally distributed across the ranges.
>>
>> If you’re bringing over the entire table, you may find dropping it and
>> then moving it to HDFS and then doing a bulk load to be more efficient.
>> (This is less flexible than sqoop, but also stresses the database servers
>> less. )
>>
>> Again, YMMV
>>
>>
>> On Apr 8, 2016, at 9:17 AM, Mich Talebzadeh 
>> wrote:
>>
>> Well unless you have plenty of memory, you are going to have certain
>> issues with Spark.
>>
>> I tried to load a billion rows table from oracle through spark using JDBC
>> and ended up with "Caused by: java.lang.OutOfMemoryError: Java heap space"
>> error.
>>
>> Sqoop uses MapR and does it in serial mode which takes time and you can
>> also tell it to create Hive table. However, it will import data into Hive
>> table.
>>
>> In any case the mechanism of data import is through JDBC, Spark uses
>> memory and DAG, whereas Sqoop relies on MapR.
>>
>> There is of course another alternative.
>>
>> Assuming that your Oracle table has a primary Key say "ID" (it would be
>> easier if it was a monotonically increasing number) or already partitioned.
>>
>>
>>1. You can create views based on the range of ID or for each
>>partition. You can then SELECT COLUMNS  co1, col2, coln from view and 
>> spool
>>it to a text file on OS (locally say backup directory would be fastest).
>>2. bzip2 those files and scp them to a local directory in Hadoop
>>3. You can then use Spark/hive to load the target table from local
>>files in parallel
>>4. When creating views take care of NUMBER and CHAR 

Re: Spark Yarn closing sparkContext

2016-04-14 Thread Ted Yu
Can you pastebin the failure message ?

Did you happen to take jstack during the close ?

Which Hadoop version do you use ?

Thanks 

> On Apr 14, 2016, at 5:53 AM, nihed mbarek  wrote:
> 
> Hi, 
> I have an issue with closing my application context, the process take a long 
> time with a fail at the end. In other part, my result was generate in the 
> write folder and _SUCESS file was created. 
> I'm using spark 1.6 with yarn. 
> 
> any idea ? 
> 
> regards, 
> 
> -- 
> 
> MBAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
> 
> 
> 


Spark streaming applicaiton don't generate Jobs after run a week ,At last,it throw oom exeception

2016-04-14 Thread yuemeng (A)
@All

There is a strange problem,I had been running a spark streaming application for 
long time,follow is the application info:


1)   Fetch data from kafka use dricet api

2)   Use sql to write each rdd data of Dstream into redis

3)   Read data from redis

Everything seems ok during one week ,after one week,this application don't 
generate jobs any more ,only print follow info in driver log after last jobs 
generated,the


16/04/14 10:37:49 INFO JobScheduler: Added jobs for time 146060068 ms

16/04/14 10:37:51 INFO MetadataCleaner: Ran metadata cleaner for 
MAP_OUTPUT_TRACKER
16/04/14 10:37:51 INFO MetadataCleaner: Ran metadata cleaner for SPARK_CONTEXT
16/04/14 10:37:51 INFO BlockManager: Dropping non broadcast blocks older than 
1460601351512
16/04/14 10:37:51 INFO BlockManager: Dropping broadcast blocks older than 
1460601351512
16/04/14 10:37:51 INFO MetadataCleaner: Ran metadata cleaner for BROADCAST_VARS
16/04/14 10:37:51 INFO MetadataCleaner: Ran metadata cleaner for BLOCK_MANAGER
16/04/14 10:37:51 INFO SparkContext: Starting job: transform at 
BindCard.scala:44
16/04/14 10:38:03 INFO BlockManager: Dropping non broadcast blocks older than 
1460601363512
16/04/14 10:38:03 INFO MetadataCleaner: Ran metadata cleaner for BLOCK_MANAGER
16/04/14 10:38:03 INFO MetadataCleaner: Ran metadata cleaner for 
MAP_OUTPUT_TRACKER
16/04/14 10:38:03 INFO MetadataCleaner: Ran metadata cleaner for SPARK_CONTEXT
16/04/14 10:38:03 INFO BlockManager: Dropping broadcast blocks older than 
1460601363513
16/04/14 10:38:03 INFO MetadataCleaner: Ran metadata cleaner for BROADCAST_VARS
16/04/14 10:38:15 INFO BlockManager: Dropping non broadcast blocks older than 
1460601375512
16/04/14 10:38:15 INFO MetadataCleaner: Ran metadata cleaner for BLOCK_MANAGER
16/04/14 10:38:15 INFO MetadataCleaner: Ran metadata cleaner for 
MAP_OUTPUT_TRACKER
16/04/14 10:38:15 INFO MetadataCleaner: Ran metadata cleaner for SPARK_CONTEXT
16/04/14 10:38:15 INFO BlockManager: Dropping broadcast blocks older than 
1460601375513
16/04/14 10:38:15 INFO MetadataCleaner: Ran metadata cleaner for BROADCAST_VARS
16/04/14 10:38:27 INFO BlockManager: Dropping non broadcast blocks older than 
1460601387513

Anyone had met this problem?
Can anyone give me some advise for this issue,any possible reasons?
I suspect  wether the cpu for driver is all used for Full GC,no time for job 
generate.








岳猛(Rick) 00277916
大数据技术开发部
*

[cid:image012.jpg@01D0D9C8.DDEDCC20]文档包

[cid:image009.png@01D0DA69.58E5C9A0]培训中心

[cid:image010.png@01D0DA69.58E5C9A0]案例库

  中软大数据3ms团队: http://3ms.huawei.com/hi/group/2031037





Spark Yarn closing sparkContext

2016-04-14 Thread nihed mbarek
Hi,
I have an issue with closing my application context, the process take a
long time with a fail at the end. In other part, my result was generate in
the write folder and _SUCESS file was created.
I'm using spark 1.6 with yarn.

any idea ?

regards,

-- 

MBAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com




Spark MLib LDA Example

2016-04-14 Thread Amit Singh Hora
Hi All,

I am very new to Spark-MLib .I am trying to understand and implement Spark
Mlib's LDA algorithm
Goal is to get Topic present documents given and terms with in those topics
.
I followed below link 
https://gist.github.com/jkbradley/ab8ae22a8282b2c8ce33
  

But getting output as
TOPIC 0
morality0.05220243077220879
being   0.035021580374984436
omniscient  0.022428246152460637
islamic 0.021139857126802202
which   0.017421282572242652
natural 0.012911262664316678
about   0.01268297163653654
article 0.012466817422546324
keith   0.01246464564083541
california  0.01203631230812281

TOPIC 1
article 0.02052394395182315
someone 0.01579814589359546
different   0.014195697566496364
would   0.013759644157873953
human   0.013303732850358341
think   0.013203570748155018
could   0.01318633594470554
saying  0.011956765545346498
there   0.011669522102424768
which   0.011298125680292148

Now i dont understand how to get actual text in place of these TOPIC 1 and
TOPIC 2



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLib-LDA-Example-tp26782.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



New syntax error in Spark with for loop

2016-04-14 Thread raghunathr85
I was using Spark 1.2.x earlier and my PySpark worked well with that version.


When I upgraded into Spark 1.5.0. I am getting Syntax Error, Same code which
worked Earlier. Dont know where is the issue.

.map(lambda (k,v): (k,list(set(v.map(lambda (k,v): (k,{v2:i for i, v2 in
enumerate(v)})).collectAsMap()
   ^
SyntaxError: invalid syntax

Could you please advice me on this ?

Thanks in Advance.






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/New-syntax-error-in-Spark-with-for-loop-tp26781.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Memory needs when using expensive operations like groupBy

2016-04-14 Thread Takeshi Yamamuro
Hi,

You should not directly use these JVM options, and
you can use `spark.executor.memory` and `spark.driver.memory` for the
optimization.

// maropu

On Thu, Apr 14, 2016 at 11:32 AM, Divya Gehlot 
wrote:

> Hi,
> I am using Spark 1.5.2 with Scala 2.10 and my Spark job keeps failing with
> exit code 143 .
> except one job where I am using unionAll and groupBy operation on multiple
> columns .
>
> Please advice me the options to optimize it .
> The one option which I am using it now
> --conf spark.executor.extraJavaOptions  -XX:MaxPermSize=1024m
> -XX:PermSize=256m --conf spark.driver.extraJavaOptions
>  -XX:MaxPermSize=1024m -XX:PermSize=256m --conf
> spark.yarn.executor.memoryOverhead=1024
>
> Need to know the best practices/better ways to optimize code.
>
> Thanks,
> Divya
>
>


-- 
---
Takeshi Yamamuro


executor running time vs getting result from jupyter notebook

2016-04-14 Thread Patcharee Thongtra

Hi,

I am running a jupyter notebook - pyspark. I noticed from the history 
server UI there are some tasks spending a lot of time on either

- executor running time
- getting result

But some tasks finished both steps very quick. All tasks however have 
very similar input size.


What can be the factor of time spending on these steps?

BR,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark 1.6.0 - token renew failure

2016-04-14 Thread Marcelo Vanzin
You can set "spark.yarn.security.tokens.hive.enabled=false" in your
config, although your app won't work if you actually need Hive
delegation tokens.

On Thu, Apr 14, 2016 at 12:21 AM, Luca Rea
 wrote:
> Hi Jeff,
>
>
>
> Thank you for your support, I’ve removed both the parameters
> (principal/keytab) form spark-defaults.conf, now the command returns an
> error that apparently seems to be related to the issue discussed in the jira
> ticket [SPARK-13478] (https://issues.apache.org/jira/browse/SPARK-13478) and
> fixed in version 2.0.0 (I’m using Spark 1.6.0), can you confirm my
> supposition please?
>
>
>
>
>
>
>
> Log stack:
>
>
>
> 16/04/14 09:07:06 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to
> load native-hadoop library for your platform... using builtin-java classes
> where applicable
>
> 16/04/14 09:07:08 INFO
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl: Timeline service
> address: http://pg-master04.contactlab.lan:8188/ws/v1/timeline/
>
> 16/04/14 09:07:08 WARN
> org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory: The short-circuit
> local reads feature cannot be used because libhadoop cannot be loaded.
>
> 16/04/14 09:07:09 INFO org.apache.hadoop.hdfs.DFSClient: Created
> HDFS_DELEGATION_TOKEN token 2136479 for luca.rea on ha-hdfs:pgha
>
> 16/04/14 09:07:09 INFO hive.metastore: Trying to connect to metastore with
> URI thrift://pg-master05.contactlab.lan:9083
>
> 16/04/14 09:07:09 ERROR org.apache.thrift.transport.TSaslTransport: SASL
> negotiation failure
>
> javax.security.sasl.SaslException: GSS initiate failed [Caused by
> GSSException: No valid credentials provided (Mechanism level: Failed to find
> any Kerberos tgt)]
>
> at
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
>
> at
> org.apache.thrift.transport.TSaslClientTransport.handleSaslStartMessage(TSaslClientTransport.java:94)
>
> at
> org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:271)
>
> at
> org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
>
> at
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
> at
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at javax.security.auth.Subject.doAs(Subject.java:415)
>
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>
> at
> org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.open(HiveMetaStoreClient.java:420)
>
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:236)
>
> at
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> Method)
>
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:132)
>
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:104)
>
> at
> org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3005)
>
>at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3024)
>
> at
> org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1234)
>
> at
> org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:174)
>
> at org.apache.hadoop.hive.ql.metadata.Hive.(Hive.java:166)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokenForHiveMetastoreInner(YarnSparkHadoopUtil.scala:204)
>
> at
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokenForHiveMetastore(YarnSparkHadoopUtil.scala:159)
>
> at
> 

Spark streaming time displayed is not current system time but it is processing current messages

2016-04-14 Thread Hemalatha A
Hi,

I am facing a problem in Spark streaming. The time displayed in Spark
streaming console is 4 days prior i.e.,  April 10th, which is not current
system time of the cluster  but the job is processing current messages that
is pushed right now April 14th.

Can anyone please advice what time does Spark streaming display? Also, when
there  is scheduling delay of say 8 hours, what time does Spark display-
current rime or   hours behind?

-- 


Regards
Hemalatha