Re: Weight column values not used in Binary Logistic Regression Summary

2017-12-09 Thread Sea aj
Hello everyone,

I have a data frame which has two columns: ids and features

each cell in feature column is an array of Vectors.dense type.
like:

[(DenseVector([0.5692]),), (DenseVector([0.5086]),)]


I need to train a new model for every single row of my data frame. How can
I do it?





‌

On Sat, Nov 18, 2017 at 9:53 AM, Stephen Boesch  wrote:

> In BinaryLogisticRegressionSummary there are @Since("1.5.0") tags on a
> number of comments identical to the following:
>
> * @note This ignores instance weights (setting all to 1.0) from 
> `LogisticRegression.weightCol`.
> * This will change in later Spark versions.
>
>
> Are there any plans to address this? Our team is using instance weights
> with sklearn LogisticRegression - and this limitation will complicate a
> potential migration.
>
>  https://github.com/apache/spark/blob/master/mllib/src/
> main/scala/org/apache/spark/ml/classification/
> LogisticRegression.scala#L1543
>
>
>


Re: Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Thanks for the reply.

As far as I understood mini batch is not yet supported in ML libarary. As
for MLLib minibatch, I could not find any pyspark api.



<https://mailtrack.io/> Sent with Mailtrack
<https://mailtrack.io/install?source=signature=en=saj3...@gmail.com=22>

On Wed, Aug 23, 2017 at 2:59 PM, Suzen, Mehmet <su...@acm.org> wrote:

> It depends on what model you would like to train but models requiring
> optimisation could use SGD with mini batches. See:
> https://spark.apache.org/docs/latest/mllib-optimization.
> html#stochastic-gradient-descent-sgd
>
> On 23 August 2017 at 14:27, Sea aj <saj3...@gmail.com> wrote:
>
>> Hi,
>>
>> I am trying to feed a huge dataframe to a ml algorithm in Spark but it
>> crashes due to the shortage of memory.
>>
>> Is there a way to train the model on a subset of the data in multiple
>> steps?
>>
>> Thanks
>>
>>
>>
>> <https://mailtrack.io/> Sent with Mailtrack
>> <https://mailtrack.io/install?source=signature=en=saj3...@gmail.com=22>
>>
>
>
>
> --
>
> Mehmet Süzen, MSc, PhD
> <su...@acm.org>
>
> | PRIVILEGED AND CONFIDENTIAL COMMUNICATION This e-mail transmission, and
> any documents, files or previous e-mail messages attached to it, may
> contain confidential information that is legally privileged. If you are not
> the intended recipient or a person responsible for delivering it to the
> intended recipient, you are hereby notified that any disclosure, copying,
> distribution or use of any of the information contained in or attached to
> this transmission is STRICTLY PROHIBITED within the applicable law. If you
> have received this transmission in error, please: (1) immediately notify me
> by reply e-mail to su...@acm.org,  and (2) destroy the original
> transmission and its attachments without reading or saving in any manner. |
>


Training A ML Model on a Huge Dataframe

2017-08-23 Thread Sea aj
Hi,

I am trying to feed a huge dataframe to a ml algorithm in Spark but it
crashes due to the shortage of memory.

Is there a way to train the model on a subset of the data in multiple steps?

Thanks



 Sent with Mailtrack



Re: UI for spark machine learning.

2017-08-22 Thread Sea aj
Jorn,

My question is not about the model type but instead, the spark capability
on reusing any already trained ml model in training a new model.




On Tue, Aug 22, 2017 at 1:13 PM, Jörn Franke <jornfra...@gmail.com> wrote:

> Is it really required to have one billion samples for just linear
> regression? Probably your model would do equally well with much less
> samples. Have you checked bias and variance if you use much less random
> samples?
>
> On 22. Aug 2017, at 12:58, Sea aj <saj3...@gmail.com> wrote:
>
> I have a large dataframe of 1 billion rows of type LabeledPoint. I tried
> to train a linear regression model on the df but it failed due to lack of
> memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of
> CPU.
>
> I decided to split my data into multiple chunks and train the model in
> multiple phases but I learned the linear regression model in ml library
> does not have "setinitialmodel" function to be able to pass the trained
> model from one chunk to the rest of chunks. In another word, each time I
> call the fit function over a chunk of my data, it overwrites the previous
> mode.
>
> So far the only solution I found is using Spark Streaming to be able to
> split the data to multiple dfs and then train over each individually to
> overcome memory issue.
>
> Do you know if there's any other solution?
>
>
>
>
> On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar <jayantbaya...@gmail.com>
> wrote:
>
>> Hello Mahesh,
>>
>> We have built one. You can download from here :
>> https://www.sparkflows.io/download
>>
>> Feel free to ping me for any questions, etc.
>>
>> Best Regards,
>> Jayant
>>
>>
>> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
>> mahesh_sawai...@persistent.com> wrote:
>>
>>> Hi,
>>>
>>>
>>> 1) Is anyone aware of any workbench kind of tool to run ML jobs in
>>> spark. Specifically is the tool  could be something like a Web application
>>> that is configured to connect to a spark cluster.
>>>
>>>
>>> User is able to select input training sets probably from hdfs , train
>>> and then run predictions, without having to write any Scala code.
>>>
>>>
>>> 2) If there is not tool, is there value in having such tool, what could
>>> be the challenges.
>>>
>>>
>>> Thanks,
>>>
>>> Mahesh
>>>
>>>
>>> DISCLAIMER
>>> ==
>>> This e-mail may contain privileged and confidential information which is
>>> the property of Persistent Systems Ltd. It is intended only for the use of
>>> the individual or entity to which it is addressed. If you are not the
>>> intended recipient, you are not authorized to read, retain, copy, print,
>>> distribute or use this message. If you have received this communication in
>>> error, please notify the sender and delete all copies of this message.
>>> Persistent Systems Ltd. does not accept any liability for virus infected
>>> mails.
>>>
>>
>>
>


Re: UI for spark machine learning.

2017-08-22 Thread Sea aj
I have a large dataframe of 1 billion rows of type LabeledPoint. I tried to
train a linear regression model on the df but it failed due to lack of
memory although I'm using 9 slaves, each with 100gb of ram and 16 cores of
CPU.

I decided to split my data into multiple chunks and train the model in
multiple phases but I learned the linear regression model in ml library
does not have "setinitialmodel" function to be able to pass the trained
model from one chunk to the rest of chunks. In another word, each time I
call the fit function over a chunk of my data, it overwrites the previous
mode.

So far the only solution I found is using Spark Streaming to be able to
split the data to multiple dfs and then train over each individually to
overcome memory issue.

Do you know if there's any other solution?




On Mon, Jul 10, 2017 at 7:57 AM, Jayant Shekhar 
wrote:

> Hello Mahesh,
>
> We have built one. You can download from here : https://www.sparkflows.io/
> download
>
> Feel free to ping me for any questions, etc.
>
> Best Regards,
> Jayant
>
>
> On Sun, Jul 9, 2017 at 9:35 PM, Mahesh Sawaiker <
> mahesh_sawai...@persistent.com> wrote:
>
>> Hi,
>>
>>
>> 1) Is anyone aware of any workbench kind of tool to run ML jobs in spark.
>> Specifically is the tool  could be something like a Web application that is
>> configured to connect to a spark cluster.
>>
>>
>> User is able to select input training sets probably from hdfs , train and
>> then run predictions, without having to write any Scala code.
>>
>>
>> 2) If there is not tool, is there value in having such tool, what could
>> be the challenges.
>>
>>
>> Thanks,
>>
>> Mahesh
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>
>


Re: SPARK Issue in Standalone cluster

2017-08-22 Thread Sea aj
Hi everyone,

I have a huge dataframe with 1 billion rows and each row is a nested list.
That being said, I want to train some ML models on this df but due to the
huge size, I get out memory error on one of my nodes when I run fit
function.

currently, my configuration is:
144 cores, 16 cores for each of the 8 nodes.
100gb of ram for each slave and 100gb of ram for the driver. I set the
maxResultSize to be 20gb.

Do you have any suggestion so far?

I can think of splitting the data to multiple dataframes and then training
the model on each individually but besides the longer runtime, I learned
that fit function overwrites the previous model each time I call it. Isn't
there a way to get the fit function to train the new model with regard to
the previously trained model?

Thanks





On Sun, Aug 6, 2017 at 11:04 PM, Gourav Sengupta 
wrote:

> Hi Marco,
>
> thanks a ton, I will surely use those alternatives.
>
>
> Regards,
> Gourav Sengupta
>
> On Sun, Aug 6, 2017 at 3:45 PM, Marco Mistroni 
> wrote:
>
>> Sengupta
>>  further to this, if you try the following notebook in databricks cloud,
>> it will read a .csv file , write to a parquet file and read it again (just
>> to count the number of rows stored)
>> Please note that the path to the csv file might differ for you.
>> So, what you will need todo is
>> 1 - create an account to community.cloud.databricks.com
>> 2 - upload the .csv file onto the Data of your databricks private cluster
>> 3  - run the script. that will store the data on the distrubuted
>> filesystem of the databricks cloudn (dbfs)
>>
>> It's worth investing in this free databricks cloud as it can create a
>> cluster for you with minimal effort, and it's  a very easy way to test your
>> spark scripts on a real cluster
>>
>> hope this helps
>> kr
>>
>> ##
>> from pyspark.sql import SQLContext
>>
>> from random import randint
>> from time import sleep
>> from pyspark.sql.session import SparkSession
>> import logging
>> logger = logging.getLogger(__name__)
>> logger.setLevel(logging.INFO)
>> ch = logging.StreamHandler()
>> logger.addHandler(ch)
>>
>>
>> import sys
>>
>> def read_parquet_file(parquetFileName):
>>   logger.info('Reading now the parquet files we just created...:%s',
>> parquetFileName)
>>   parquet_data = sqlContext.read.parquet(parquetFileName)
>>   logger.info('Parquet file has %s', parquet_data.count())
>>
>> def dataprocessing(filePath, count, sqlContext):
>> logger.info( 'Iter count is:%s' , count)
>> if count == 0:
>> print 'exiting'
>> else:
>> df_traffic_tmp = sqlContext.read.format("csv").
>> option("header",'true').load(filePath)
>> logger.info( '#DataSet has:%s' ,
>> df_traffic_tmp.count())
>> logger.info('WRting to a parquet file')
>> parquetFileName = "dbfs:/myParquetDf2.parquet"
>> df_traffic_tmp.write.parquet(parquetFileName)
>> sleepInterval = randint(10,100)
>> logger.info( '#Sleeping for %s' ,
>> sleepInterval)
>> sleep(sleepInterval)
>> read_parquet_file(parquetFileName)
>> dataprocessing(filePath, count-1, sqlContext)
>>
>> filename = '/FileStore/tables/wb4y1wrv1502027870004/tree_addhealth.csv'#This
>> path might differ for you
>> iterations = 1
>> logger.info('--')
>> logger.info('Filename:%s', filename)
>> logger.info('Iterations:%s', iterations )
>> logger.info('--')
>>
>> logger.info ('Initializing sqlContext')
>> logger.info( 'Starting spark..Loading from%s for %s
>> iterations' , filename, iterations)
>> logger.info(  'Starting up')
>> sc = SparkSession.builder.appName("Data Processsing").getOrCreate()
>> logger.info ('Initializing sqlContext')
>> sqlContext = SQLContext(sc)
>> dataprocessing(filename, iterations, sqlContext)
>> logger.info('Out of here..')
>> ##
>>
>>
>> On Sat, Aug 5, 2017 at 9:09 PM, Marco Mistroni 
>> wrote:
>>
>>> Uh believe me there are lots of ppl on this list who will send u code
>>> snippets if u ask... 
>>>
>>> Yes that is what Steve pointed out, suggesting also that for that simple
>>> exercise you should perform all operations on a spark standalone instead
>>> (or alt. Use an nfs on the cluster)
>>> I'd agree with his suggestion
>>> I suggest u another alternative:
>>> https://community.cloud.databricks.com/
>>>
>>> That's a ready made cluster and you can run your spark app as well store
>>> data on the cluster (well I haven't tried myself but I assume it's
>>> possible).   Try that out... I will try ur script there as I have an
>>> account there (though I guess I'll get there before me.)
>>>
>>> Try that out and let me know if u get stuck
>>> Kr
>>>
>>> On Aug 5, 2017 8:40 PM, "Gourav Sengupta" 
>>> wrote:
>>>
 Hi Marco,

 

Reading csv.gz files

2017-07-05 Thread Sea aj
I need to import a set of files with csv.gz extension into Spark. each file
contains a table of data. I was wondering if anyone knows how to read it?



 Sent with Mailtrack



How does Spark deal with Data Skewness?

2017-06-22 Thread Sea aj
Hi everyone,

I have read about some interesting ideas on how to manage skew but I was
not sure if any of these techniques are being used in Spark 2.x versions or
not? To name a few, "Salting the Data" and "Dynamic Repartitioning" are
techniques introduced in Spark Summits. I am really curious to know whether
if Spark takes care of skew at all or not?





   Sent with Mailtrack



Re: SparkSQL not able to read a empty table location

2017-05-21 Thread Sea
please try spark.sql.hive.verifyPartitionPath true


-- Original --
From:  "Steve Loughran";;
Date:  Sat, May 20, 2017 09:19 PM
To:  "Bajpai, Amit X. -ND"; 
Cc:  "user@spark.apache.org"; 
Subject:  Re: SparkSQL not able to read a empty table location



 
   On 20 May 2017, at 01:44, Bajpai, Amit X. -ND  
wrote:
 
Hi,
   
  I have a hive external table with the S3 location having no files (but the S3 
location directory does exists). When I am trying to use Spark SQL to count the 
number of records in the table it is throwing error saying  ??File 
s3n://data/xyz does not exist. null/0??.
   
  select * from tablex limit 10
   
  Can someone let me know how we can fix this issue.
   
  Thanks
 
 
  
 
 
 
 There isn't really a "directory" in S3, just a set of objects whose paths 
begin with a string. Try creating an empty file with an _ prefix in the 
directory; it should be ignored by Spark SQL but will cause the "directory" to 
come into being

?????? How to specify file

2016-09-23 Thread Sea
Hi, Hemant, Aditya:
I don't want to create temp table and write code, I just want to run sql 
directly on files "select * from csv.`/path/to/file`"





--  --
??: "Hemant Bhanawat";<hemant9...@gmail.com>;
: 2016??9??23??(??????) 3:32
??: "Sea"<261810...@qq.com>; 
: "user"<user@spark.apache.org>; 
: Re: How to specify file




Check out the READEME on the following page. This is the csv connector that you 
are using. I think you need to specify the delimiter option.  

https://github.com/databricks/spark-csv

Hemant Bhanawat

www.snappydata.io 







 
On Fri, Sep 23, 2016 at 12:26 PM, Sea <261810...@qq.com> wrote:
Hi, I want to run sql directly on files, I find that spark has supported sql 
like select * from csv.`/path/to/file`, but files may not be split by ','. 
Maybe it is split by '\001', how can I specify delimiter?

Thank you!

How to specify file

2016-09-23 Thread Sea
Hi, I want to run sql directly on files, I find that spark has supported sql 
like select * from csv.`/path/to/file`, but files may not be split by ','. 
Maybe it is split by '\001', how can I specify delimiter?

Thank you!

?????? Spark hangs at "Removed broadcast_*"

2016-07-12 Thread Sea
please provide your jstack info.




--  --
??: "dhruve ashar";;
: 2016??7??13??(??) 3:53
??: "Anton Sviridov"; 
: "user"; 
: Re: Spark hangs at "Removed broadcast_*"



Looking at the jstack, it seems that it doesn't contain all the threads. Cannot 
find the main thread in the jstack.

I am not an expert on analyzing jstacks, but are you creating any threads in 
your code? Shutting them down correctly?


This one is a non-daemon and doesn't seem to be coming from Spark. 
"Scheduler-2144644334" #110 prio=5 os_prio=0 tid=0x7f8104001800 nid=0x715 
waiting on condition [0x7f812cf95000]



Also, does the shutdown hook get called? 




On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov  wrote:
Hi.

Here's the last few lines before it starts removing broadcasts:


16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_20160723_0009_m_003209_20886' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_20160723_0009_m_003209
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_20160723_0009_m_003209_20886: Committed
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0 (TID 
20888) in 95 ms on localhost (3209/3214)
16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID 20886). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0 (TID 
20886) in 103 ms on localhost (3210/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_20160723_0009_m_003208_20885' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_20160723_0009_m_003208
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_20160723_0009_m_003208_20885: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID 20885). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0 (TID 
20885) in 109 ms on localhost (3211/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_20160723_0009_m_003212_20889' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_20160723_0009_m_003212
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_20160723_0009_m_003212_20889: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID 20889). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0 (TID 
20889) in 84 ms on localhost (3212/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_20160723_0009_m_003210_20887' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_20160723_0009_m_003210
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_20160723_0009_m_003210_20887: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID 20887). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0 (TID 
20887) in 100 ms on localhost (3213/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_20160723_0009_m_003213_20890' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_20160723_0009_m_003213
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_20160723_0009_m_003213_20890: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID 20890). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0 (TID 
20890) in 82 ms on localhost (3214/3214)
16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have 
all completed, from pool
16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at 
SfCountsDumper.scala:13) finished in 42.294 s
16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at 
SfCountsDumper.scala:13, took 9517.124624 s
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
16/07/11 14:28:46 INFO BlockManager: 

?????? Bug about reading parquet files

2016-07-08 Thread Sea
.collection.mutable.GrowingBuilder.$plus$eq(GrowingBuilder.scala:24)
  at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
  at 
scala.collection.generic.Growable$$anonfun$$plus$plus$eq$1.apply(Growable.scala:48)
  at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
  at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
  at 
scala.collection.mutable.GrowingBuilder.$plus$plus$eq(GrowingBuilder.scala:24)
  at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:48)
  at 
org.apache.spark.sql.sources.HadoopFsRelation$.listLeafFilesInParallel(interfaces.scala:910)
  at 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.listLeafFiles(interfaces.scala:445)
  at 
org.apache.spark.sql.sources.HadoopFsRelation$FileStatusCache.refresh(interfaces.scala:477)
  at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache$lzycompute(interfaces.scala:489)
  at 
org.apache.spark.sql.sources.HadoopFsRelation.org$apache$spark$sql$sources$HadoopFsRelation$$fileStatusCache(interfaces.scala:487)
  at 
org.apache.spark.sql.sources.HadoopFsRelation.cachedLeafStatuses(interfaces.scala:494)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$MetadataCache.refresh(ParquetRelation.scala:398)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache$lzycompute(ParquetRelation.scala:145)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.org$apache$spark$sql$execution$datasources$parquet$ParquetRelation$$metadataCache(ParquetRelation.scala:143)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anonfun$6.apply(ParquetRelation.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
  at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:202)
  at 
org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:636)
  at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:635)
  at 
org.apache.spark.sql.execution.datasources.LogicalRelation.(LogicalRelation.scala:37)





--  --
??: "lian.cs.zju";<lian.cs@gmail.com>;
????: 2016??7??8??(??) 4:47
??: "Sea"<261810...@qq.com>; 
: "user"<user@spark.apache.org>; 
: Re: Bug about reading parquet files



What's the Spark version? Could you please also attach result of 
explain(extended = true)?


On Fri, Jul 8, 2016 at 4:33 PM, Sea <261810...@qq.com> wrote:
I have a problem reading parquet files.
sql:
select count(1) from   omega.dwd_native where year='2016' and month='07' and 
day='05' and hour='12' and appid='6';
The hive partition is (year,month,day,appid)


only two tasks, and it will list all directories in my table, not only 
/user/omega/events/v4/h/2016/07/07/12/appid=6
[Stage 1:>  (0 + 0) / 2]


16/07/08 16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/05/31/21/appid=1 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/06/28/20/appid=2 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/07/22/21/appid=65537 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/08/14/05/appid=65536

Bug about reading parquet files

2016-07-08 Thread Sea
I have a problem reading parquet files.
sql:
select count(1) from   omega.dwd_native where year='2016' and month='07' and 
day='05' and hour='12' and appid='6';
The hive partition is (year,month,day,appid)


only two tasks, and it will list all directories in my table, not only 
/user/omega/events/v4/h/2016/07/07/12/appid=6
[Stage 1:>  (0 + 0) / 2]


16/07/08 16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/05/31/21/appid=1 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/06/28/20/appid=2 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/07/22/21/appid=65537 16/07/08 
16:16:51 INFO sources.HadoopFsRelation: Listing 
hdfs://mycluster-tj/user/omega/events/v4/h/2016/08/14/05/appid=65536

??????????: G1 GC takes too much time

2016-05-29 Thread Sea
Yes, It seems like that CMS is better. I have tried G1 as databricks' blog 
recommended, but it's too slow.




--  --
??: "condor join";;
: 2016??5??30??(??) 10:17
??: "Ted Yu"; 
: "user@spark.apache.org"; 
: : G1 GC takes too much time



  The follwing are the parameters:
 -XX:+UseG1GC -XX:+UnlockDiagnostivVMOptions -XX:G1SummarizeConcMark
 -XX:InitiatingHeapOccupancyPercent=35
 
 spark.executor.memory=4G
 
 
 
 
 ??: Ted Yu 
 : 2016??5??30?? 9:47:05
 ??: condor join
 : user@spark.apache.org
 : Re: G1 GC takes too much time  
 
  bq. It happens during the Reduce majority. 
 
 Did the above refer to reduce operation ?
 
 
 Can you share your G1GC parameters (and heap size for workers) ?
 
 
 Thanks
 
 
 On Sun, May 29, 2016 at 6:15 PM, condor join   wrote:
Hi, my spark application failed due to take too much time during GC. 
Looking at the logs I found these things:
 1.there are Young GC takes too much time,and not found Full GC happen this;
 2.The time takes too much during the object copy;
 3.It happened  more easily when there were not enough resources;
 4.It happens during the Reduce majority.
 
 
 have anyone met the same question?
 thanks
 
 
 
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

??????spark sql on hive

2016-04-18 Thread Sea
It's a bug of hive. Please use hive metastore service instead of visiting mysql 
directly.
set hive.metastore.uris in hive-site.xml






--  --
??: "Jieliang Li";;
: 2016??4??19??(??) 12:55
??: "user"; 

: spark sql on hive



hi everyone.i use spark sql, but throw an exception:
Retrying creating default database after error: Error creating transactional 
connection factory
javax.jdo.JDOFatalInternalException: Error creating transactional connection 
factory
at 
org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:587)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:788)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
at 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
at java.security.AccessController.doPrivileged(Native Method)
at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
at 
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at 
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at 
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at 
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:56)
at 
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:65)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:579)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:557)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:606)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:448)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5601)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:193)
at 
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at 
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1486)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:64)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:74)
at 
org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2841)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2860)
at 
org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:453)
at 
org.apache.spark.sql.hive.HiveContext.sessionState$lzycompute(HiveContext.scala:229)
at 
org.apache.spark.sql.hive.HiveContext.sessionState(HiveContext.scala:225)
at 
org.apache.spark.sql.hive.HiveContext$QueryExecution.(HiveContext.scala:373)
at 
org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:80)
at 
org.apache.spark.sql.hive.HiveContext.executePlan(HiveContext.scala:49)
at org.apache.spark.sql.DataFrame.(DataFrame.scala:131)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 

?????? Limit pyspark.daemon threads

2016-03-18 Thread Sea
It's useless...  The python worker will go above 1.5g in my production 
environment




--  --
??: "Ted Yu";;
: 2016??3??17??(??) 10:50
??: "Carlile, Ken"; 
: "user"; 
: Re: Limit pyspark.daemon threads



I took a look at docs/configuration.md
Though I didn't find answer for your first question, I think the following 
pertains to your second question:



  spark.python.worker.memory
  512m
  
Amount of memory to use per python worker process during aggregation, in 
the same
format as JVM memory strings (e.g. 512m, 2g). If 
the memory
used during aggregation goes above this amount, it will spill the data into 
disks.
  




On Thu, Mar 17, 2016 at 7:43 AM, Carlile, Ken  wrote:
Hello,
 
 We have an HPC cluster that we run Spark jobs on using standalone mode and a 
number of scripts I??ve built up to dynamically schedule and start spark 
clusters within the Grid Engine framework. Nodes in the cluster have 16 cores 
and 128GB of RAM.
 
 My users use pyspark heavily. We??ve been having a number of problems with 
nodes going offline with extraordinarily high load. I was able to look at one 
of those nodes today before it went truly sideways, and I discovered that the 
user was running 50 pyspark.daemon threads (remember, this is a 16 core box), 
and the load was somewhere around 25 or so, with all CPUs maxed out at 100%.
 
 So while the spark worker is aware it??s only got 16 cores and behaves 
accordingly, pyspark seems to be happy to overrun everything like crazy. Is 
there a global parameter I can use to limit pyspark threads to a sane number, 
say 15 or 16? It would also be interesting to set a memory limit, which leads 
to another question.
 
 How is memory managed when pyspark is used? I have the spark worker memory set 
to 90GB, and there is 8GB of system overhead (GPFS caching), so if pyspark 
operates outside of the JVM memory pool, that leaves it at most 30GB to play 
with, assuming there is no overhead outside the JVM??s 90GB heap (ha ha.)
 
 Thanks,
 Ken Carlile
 Sr. Unix Engineer
 HHMI/Janelia Research Campus
 571-209-4363

?????? mapwithstate Hangs with Error cleaning broadcast

2016-03-15 Thread Sea
Hi,manas:
 Maybe you can look at this bug: 
https://issues.apache.org/jira/browse/SPARK-13566






--  --
??: "manas kar";;
: 2016??3??15??(??) 10:48
??: "Ted Yu"; 
: "user"; 
: Re: mapwithstate Hangs with Error cleaning broadcast



I am using spark 1.6.I am not using any broadcast variable.
This broadcast variable is probably used by the state management of mapwithState


...Manas


On Tue, Mar 15, 2016 at 10:40 AM, Ted Yu  wrote:
Which version of Spark are you using ?

Can you show the code snippet w.r.t. broadcast variable ?


Thanks


On Tue, Mar 15, 2016 at 6:04 AM, manasdebashiskar  wrote:
Hi,
  I have a streaming application that takes data from a kafka topic and uses
 mapwithstate.
  After couple of hours of smooth running of the application I see a problem
 that seems to have stalled my application.
 The batch seems to have been stuck after the following error popped up.
 Has anyone seen this error or know what causes it?
 14/03/2016 21:41:13,295 ERROR org.apache.spark.ContextCleaner: 95 - Error
 cleaning broadcast 7456
 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120
 seconds]. This timeout is controlled by spark.rpc.askTimeout
 at
 
org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
 at
 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
 at
 
org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
 at
 scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
 at
 
org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136)
 at
 
org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228)
 at
 
org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
 at
 
org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67)
 at
 org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:233)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:189)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:180)
 at scala.Option.foreach(Option.scala:236)
 at
 
org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:180)
 at
 org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
 at
 
org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:173)
 at
 org.apache.spark.ContextCleaner$$anon$3.run(ContextCleaner.scala:68)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after
 [120 seconds]
 at
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
 at
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
 at
 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:107)
 at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
 ... 12 more
 
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/mapwithstate-Hangs-with-Error-cleaning-broadcast-tp26500.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

?????? Spark UI standalone "crashes" after an application finishes

2016-02-29 Thread Sea
Hi, Sumona:
  It's a bug in Spark old version, In spark 1.6.0, it is fixed.
  After the application complete, spark master will load event log to 
memory, and it is sync because of actor. If the event log is big, spark master 
will hang a long time, and you can not submit any applications, if your master 
memory is to small, you master will die!
  The solution in spark 1.6 is not very good, the operation is async, and 
so you still need to set a big java heap for master.






--  --
??: "Shixiong(Ryan) Zhu";;
: 2016??3??1??(??) 8:02
??: "Sumona Routh"; 
: "user@spark.apache.org"; 
: Re: Spark UI standalone "crashes" after an application finishes



Do you mean you cannot access Master UI after your application completes? Could 
you check the master log?

On Mon, Feb 29, 2016 at 3:48 PM, Sumona Routh  wrote:
Hi there,

I've been doing some performance tuning of our Spark application, which is 
using Spark 1.2.1 standalone. I have been using the spark metrics to graph out 
details as I run the jobs, as well as the UI to review the tasks and stages.


I notice that after my application completes, or is near completion, the UI 
"crashes." I get a Connection Refused response. Sometimes, the page eventually 
recovers and will load again, but sometimes I end up having to restart the 
Spark master to get it back. When I look at my graphs on the app, the memory 
consumption (of driver, executors, and what I believe to be the daemon 
(spark.jvm.total.used)) appears to be healthy. Monitoring the master machine 
itself, memory and CPU appear healthy as well.


Has anyone else seen this issue? Are there logs for the UI itself, and where 
might I find those?


Thanks!

Sumona

Deadlock between UnifiedMemoryManager and BlockManager

2016-02-29 Thread Sea
Hi??all??
 My spark version is 1.6.0, I found a deadlock in production environment, 
Anyone can help? I create an issue in jira: 
https://issues.apache.org/jira/browse/SPARK-13566




===
"block-manager-slave-async-thread-pool-1":
at org.apache.spark.storage.MemoryStore.remove(MemoryStore.scala:216)
- waiting to lock <0x0005895b09b0> (a 
org.apache.spark.memory.UnifiedMemoryManager)
at 
org.apache.spark.storage.BlockManager.removeBlock(BlockManager.scala:1114)
- locked <0x00058ed6aae0> (a org.apache.spark.storage.BlockInfo)
at 
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
at 
org.apache.spark.storage.BlockManager$$anonfun$removeBroadcast$2.apply(BlockManager.scala:1101)
at scala.collection.immutable.Set$Set2.foreach(Set.scala:94)
at 
org.apache.spark.storage.BlockManager.removeBroadcast(BlockManager.scala:1101)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply$mcI$sp(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$receiveAndReply$1$$anonfun$applyOrElse$4.apply(BlockManagerSlaveEndpoint.scala:65)
at 
org.apache.spark.storage.BlockManagerSlaveEndpoint$$anonfun$1.apply(BlockManagerSlaveEndpoint.scala:84)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
"Executor task launch worker-10":
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1032)
- waiting to lock <0x00059a0988b8> (a 
org.apache.spark.storage.BlockInfo)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:1009)
at 
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:460)
at 
org.apache.spark.storage.MemoryStore$$anonfun$evictBlocksToFreeSpace$2.apply(MemoryStore.scala:449)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

??????off-heap certain operations

2016-02-11 Thread Sea
spark.memory.offHeap.enabled (default is false) , it is wrong in spark docs. 
Spark1.6 do not recommend to use off-heap memory.




--  --
??: "Ovidiu-Cristian MARCU";;
: 2016??2??12??(??) 5:51
??: "user"; 

: off-heap certain operations



Hi,

Reading though the latest documentation for Memory management I can see that 
the parameter spark.memory.offHeap.enabled (true by default) is described with 
??If true, Spark will attempt to use off-heap memory for certain operations?? 
[1].


Can you please describe the certain operations you are referring to?  


http://spark.apache.org/docs/latest/configuration.html#memory-management


Thank!


Best,
Ovidiu

?????? Shuffle memory woes

2016-02-07 Thread Sea
Hi??Corey??
   "The dataset is 100gb at most, the spills can up to 10T-100T", Are your 
input files lzo format, and you use sc.text() ? If memory is not enough, spark 
will spill 3-4x of input data to disk.




--  --
??: "Corey Nolet";;
: 2016??2??7??(??) 8:56
??: "Igor Berman"; 
: "user"; 
: Re: Shuffle memory woes



As for the second part of your questions- we have a fairly complex join process 
which requires a ton of stage orchestration from our driver. I've written some 
code to be able to walk down our DAG tree and execute siblings in the tree 
concurrently where possible (forcing cache to disk on children that that have 
multiple chiildren themselves so that they can be run concurrently). Ultimatey, 
we have seen significant speedup in our jobs by keeping tasks as busy as 
possible processing concurrent stages. Funny enough though, the stage that is 
causing problems with shuffling for us has a lot of children and doesn't even 
run concurrently with any other stages so I ruled out the concurrency of the 
stages as a culprit for the shuffliing problem we're seeing.

On Sun, Feb 7, 2016 at 7:49 AM, Corey Nolet  wrote:
Igor,

I don't think the question is "why can't it fit stuff in memory". I know why it 
can't fit stuff in memory- because it's a large dataset that needs to have a 
reduceByKey() run on it. My understanding is that when it doesn't fit into 
memory it needs to spill in order to consolidate intermediary files into a 
single file. The more data you need to run through this, the more it will need 
to spill. My findings is that once it gets stuck in this spill chain with our 
dataset it's all over @ that point because it will spill and spill and spill 
and spill and spill. If I give the shuffle enough memory it won't- irrespective 
of the number of partitions we have (i've done everything from repartition(500) 
to repartition(2500)). It's not a matter of running out of memory on a single 
node because the data is skewed. It's more a matter of the shuffle buffer 
filling up and needing to spill. I think what may be happening is that it gets 
to a point where it's spending more time reading/writing from disk while doing 
the spills then it is actually processing any data. I can tell this because I 
can see that the spills sometimes get up into the 10's to 100's of TB where the 
input data was maybe acquireExecutionMemory at most. Unfortunately my code is 
on a private internal network and I'm not able to share it. 


On Sun, Feb 7, 2016 at 3:38 AM, Igor Berman  wrote:
so can you provide code snippets: especially it's interesting to see what are 
your transformation chain, how many partitions are there on each side of 
shuffle operation

the question is why it can't fit stuff in memory when you are shuffling - maybe 
your partitioner on "reduce" side is not configured properly? I mean if map 
side is ok, and you just reducing by key or something it should be ok, so some 
detail is missing...skewed data? aggregate by key?


On 6 February 2016 at 20:13, Corey Nolet  wrote:
Igor,

Thank you for the response but unfortunately, the problem I'm referring to goes 
beyond this. I have set the shuffle memory fraction to be 90% and set the cache 
memory to be 0. Repartitioning the RDD helped a tad on the map side but didn't 
do much for the spilling when there was no longer any memory left for the 
shuffle. Also the new auto-memory management doesn't seem like it'll have too 
much of an effect after i've already given most the memory i've allocated to 
the shuffle. The problem I'm having is most specifically related to the shuffle 
performing declining by several orders of magnitude when it needs to spill 
multiple times (it ends up spilling several hundred for me when it can't fit 
stuff into memory).






On Sat, Feb 6, 2016 at 6:40 AM, Igor Berman  wrote:
Hi,usually you can solve this by 2 steps
make rdd to have more partitions
play with shuffle memory fraction


in spark 1.6 cache vs shuffle memory fractions are adjusted automatically


On 5 February 2016 at 23:07, Corey Nolet  wrote:
I just recently had a discovery that my jobs were taking several hours to 
completely because of excess shuffle spills. What I found was that when I hit 
the high point where I didn't have enough memory for the shuffles to store all 
of their file consolidations at once, it could spill so many times that it 
causes my job's runtime to increase by orders of magnitude (and sometimes fail 
altogether).


I've played with all the tuning parameters I can find. To speed the shuffles 
up, I tuned the akka threads to different values. I also tuned the shuffle 
buffering a tad (both up and down). 


I feel like I see a weak point here. The mappers are sharing memory space with 
reducers 

How to query data in tachyon with spark-sql

2016-01-20 Thread Sea
Hi,all  I want to mount some hive table in tachyon, but I don't know how to 
query data in tachyon with spark-sql, who knows?

?????? How to use Java8

2016-01-05 Thread Sea
thanks




--  --
??: "Andy Davidson";<a...@santacruzintegration.com>;
: 2016??1??6??(??) ????11:04
??: "Sea"<261810...@qq.com>; "user"<user@spark.apache.org>; 

????: Re: How to use Java8





Hi Sea


From:  Sea <261810...@qq.com>
Date:  Tuesday, January 5, 2016 at 6:16 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  How to use Java8



Hi, all
I want to support java8, I use JDK1.8.0_65 in production environment, but 
it doesn't work. Should I build spark using jdk1.8, and set 
1.8 in pom.xml?


java.lang.UnsupportedClassVersionError:  Unsupported major.minor version 52.






Here are some notes I wrote about how to configure my data center to use java 
8. You??ll probably need to do something like this


Your mileage may vary


Andy



Setting Java_HOME

ref: configure env vars

install java 8 on all nodes (master and slave)

install java 1.8 on master
$ ssh -i $KEY_FILE root@$SPARK_MASTER # ?? how was this package download from 
oracle? curl? yum install jdk-8u65-linux-x64.rpm 
copy rpm to slaves and install java 1.8 on slaves
for i in `cat /root/spark-ec2/slaves`;do scp 
/home/ec2-user/jdk-8u65-linux-x64.rpm $i:; done pssh -i -h 
/root/spark-ec2/slaves ls -l pssh -i -h /root/spark-ec2/slaves yum install -y 
jdk-8u65-linux-x64.rpm 
remove rpm from slaves. It is 153M
pssh -i -h /root/spark-ec2/slaves rm jdk-8u65-linux-x64.rpm 
Configure spark to use java 1.8

ref: configure env vars

Make a back up of of config file
cp /root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
+%Y-%m-%d:%H:%M:%S` pssh -i -h /root/spark-ec2/slaves cp 
/root/spark/conf/spark-env.sh /root/spark/conf/spark-env.sh-`date 
+%Y-%m-%d:%H:%M:%S` pssh -i -h /root/spark-ec2/slaves ls 
"/root/spark/conf/spark-env.sh*" 
Edit /root/spark/conf/spark-env.sh, add
 export JAVA_HOME=/usr/java/latest
Copy spark-env.sh to slaves
pssh -i -h /root/spark-ec2/slaves grep JAVA_HOME /root/spark/conf/spark-env.sh 
for i in `cat /root/spark-ec2/slaves`;do scp /root/spark/conf/spark-env.sh 
$i:/root/spark/conf/spark-env.sh; done pssh -i -h /root/spark-ec2/slaves grep 
JAVA_HOME /root/spark/conf/spark-env.sh

How to use Java8

2016-01-05 Thread Sea
Hi, all
I want to support java8, I use JDK1.8.0_65 in production environment, but 
it doesn't work. Should I build spark using jdk1.8, and set 
1.8 in pom.xml?


java.lang.UnsupportedClassVersionError:  Unsupported major.minor version 52.

[Spark Streaming] Unable to write checkpoint when restart

2015-11-21 Thread Sea
When I restart my streaming program??this bug found And it will kill my 
program
I am using spark 1.4.1


15/11/22 03:20:00 WARN CheckpointWriter: Error in attempt 1 of writing 
checkpoint to hdfs://streaming/user/dm/order_predict/streaming_
v2/10/checkpoint/checkpoint-144813360
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
 Lease mismatch on /user/dm/order_
predict/streaming_v2/10/checkpoint/temp owned by 
DFSClient_NONMAPREDUCE_558833758_1 but is accessed by 
DFSClient_NONMAPREDUCE_20734830
69_1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2752)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFileInternal(FSNamesystem.java:2801)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2783)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:611)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTra
nslatorPB.java:428)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNameno
deProtocolProtos.java:59586)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy7.complete(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.complete(ClientNamenodeProtocolTranslatorPB.java:371)
at sun.reflect.GeneratedMethodAccessor41.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy8.complete(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:1894)
at 
org.apache.hadoop.hdfs.DFSOutputStream.close(DFSOutputStream.java:1881)
at 
org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:71)

Re?? About memory leak in spark 1.4.1

2015-08-05 Thread Sea
No one help me... I help myself, I split the cluster to two cluster 1.4.1 
and 1.3.0




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??4??(??) 10:28
??: Igor Bermanigor.ber...@gmail.com; 
: Sea261810...@qq.com; Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



w.r.t. spark.deploy.spreadOut , here is the scaladoc:

  // As a temporary workaround before better ways of configuring memory, we 
allow users to set
  // a flag that will perform round-robin scheduling across the nodes 
(spreading out each app
  // among all the nodes) instead of trying to consolidate each app onto a 
small # of nodes.
  private val spreadOutApps = conf.getBoolean(spark.deploy.spreadOut, true)



Cheers


On Tue, Aug 4, 2015 at 4:13 AM, Igor Berman igor.ber...@gmail.com wrote:
sorry, can't disclose info about my prod cluster
nothing jumps into my mind regarding your config
we don't use lz4 compression, don't know what is spark.deploy.spreadOut(there 
is no documentation regarding this)


If you are sure that you don't have memory leak in your business logic I would 
try to reset each property to default(or just remove it from your config) and 
try to run your job to see if it's not

somehow connected



my config(nothing special really)
spark.shuffle.consolidateFiles true
spark.speculation false

spark.executor.extraJavaOptions -XX:+UseStringCache -XX:+UseCompressedStrings 
-XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:gc.log 
-verbose:gc
spark.executor.logs.rolling.maxRetainedFiles 1000
spark.executor.logs.rolling.strategy time
spark.worker.cleanup.enabled true
spark.logConf true
spark.rdd.compress true











On 4 August 2015 at 12:59, Sea 261810...@qq.com wrote:
How much machines are there in your standalone cluster?

I am not using tachyon.


GC can not help me... Can anyone help ?


my configuration:


spark.deploy.spreadOut false
spark.eventLog.enabled true
spark.executor.cores 24


spark.ui.retainedJobs 10
spark.ui.retainedStages 10
spark.history.retainedApplications 5
spark.deploy.retainedApplications 10
spark.deploy.retainedDrivers  10
spark.streaming.ui.retainedBatches 10
spark.sql.thriftserver.ui.retainedSessions 10
spark.sql.thriftserver.ui.retainedStatements 100



spark.file.transferTo false
spark.driver.maxResultSize 4g
spark.sql.hive.metastore.jars=/spark/spark-1.4.1/hive/*


spark.eventLog.dirhdfs://mycluster/user/spark/historylog
spark.history.fs.logDirectory hdfs://mycluster/user/spark/historylog



spark.driver.extraClassPath=/spark/spark-1.4.1/extlib/*
spark.executor.extraClassPath=/spark/spark-1.4.1/extlib/*



spark.sql.parquet.binaryAsString true
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer 32
spark.kryoserializer.buffer.max 256
spark.shuffle.consolidateFiles true
spark.io.compression.codec org.apache.spark.io.LZ4CompressionCodec











--  --
??: Igor Berman;igor.ber...@gmail.com;
: 2015??8??3??(??) 7:56
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; Ted Yuyuzhih...@gmail.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1





in general, what is your configuration? use --conf spark.logConf=true



we have 1.4.1 in production standalone cluster and haven't experienced what you 
are describingcan you verify in web-ui that indeed spark got your 50g per 
executor limit? I mean in configuration page..


might be you are using offheap storage(Tachyon)?




On 3 August 2015 at 04:58, Sea 261810...@qq.com wrote:
spark uses a lot more than heap memory, it is the expected behavior.  It 
didn't exist in spark 1.3.x
What does a lot more than means?  It means that I lose control of it!
I try to  apply 31g, but it still grows to 55g and continues to grow!!! That is 
the point!
I have tried set memoryFraction to 0.2??but it didn't help.
I don't know whether it will still exist in the next release 1.5, I wish not.






--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 9:55
??: Sea261810...@qq.com; Ted Yuyuzhih...@gmail.com; 
: user@spark.apache.orguser@spark.apache.org; 
rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1





spark uses a lot more than heap memory, it is the expected behavior.in 1.4 
off-heap memory usage is supposed to grow in comparison to 1.3


Better use as little memory as you can for heap, and since you are not 
utilizing it already, it is safe for you to reduce it.
memoryFraction helps

Re?? About memory leak in spark 1.4.1

2015-08-04 Thread Sea
How much machines are there in your standalone cluster?

I am not using tachyon.


GC can not help me... Can anyone help ?


my configuration:


spark.deploy.spreadOut false
spark.eventLog.enabled true
spark.executor.cores 24


spark.ui.retainedJobs 10
spark.ui.retainedStages 10
spark.history.retainedApplications 5
spark.deploy.retainedApplications 10
spark.deploy.retainedDrivers  10
spark.streaming.ui.retainedBatches 10
spark.sql.thriftserver.ui.retainedSessions 10
spark.sql.thriftserver.ui.retainedStatements 100



spark.file.transferTo false
spark.driver.maxResultSize 4g
spark.sql.hive.metastore.jars=/spark/spark-1.4.1/hive/*


spark.eventLog.dirhdfs://mycluster/user/spark/historylog
spark.history.fs.logDirectory hdfs://mycluster/user/spark/historylog



spark.driver.extraClassPath=/spark/spark-1.4.1/extlib/*
spark.executor.extraClassPath=/spark/spark-1.4.1/extlib/*



spark.sql.parquet.binaryAsString true
spark.serializerorg.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer 32
spark.kryoserializer.buffer.max 256
spark.shuffle.consolidateFiles true
spark.io.compression.codec org.apache.spark.io.LZ4CompressionCodec











--  --
??: Igor Berman;igor.ber...@gmail.com;
: 2015??8??3??(??) 7:56
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; Ted Yuyuzhih...@gmail.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



in general, what is your configuration? use --conf spark.logConf=true



we have 1.4.1 in production standalone cluster and haven't experienced what you 
are describingcan you verify in web-ui that indeed spark got your 50g per 
executor limit? I mean in configuration page..


might be you are using offheap storage(Tachyon)?




On 3 August 2015 at 04:58, Sea 261810...@qq.com wrote:
spark uses a lot more than heap memory, it is the expected behavior.  It 
didn't exist in spark 1.3.x
What does a lot more than means?  It means that I lose control of it!
I try to  apply 31g, but it still grows to 55g and continues to grow!!! That is 
the point!
I have tried set memoryFraction to 0.2??but it didn't help.
I don't know whether it will still exist in the next release 1.5, I wish not.






--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 9:55
??: Sea261810...@qq.com; Ted Yuyuzhih...@gmail.com; 
: user@spark.apache.orguser@spark.apache.org; 
rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1





spark uses a lot more than heap memory, it is the expected behavior.in 1.4 
off-heap memory usage is supposed to grow in comparison to 1.3


Better use as little memory as you can for heap, and since you are not 
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application profile 
while keeping it tight.



 






On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote:

spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 


: Re: About memory leak in spark 1.4.1




http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak

Re?? About memory leak in spark 1.4.1

2015-08-02 Thread Sea
spark uses a lot more than heap memory, it is the expected behavior.  It 
didn't exist in spark 1.3.x
What does a lot more than means?  It means that I lose control of it!
I try to  apply 31g, but it still grows to 55g and continues to grow!!! That is 
the point!
I have tried set memoryFraction to 0.2??but it didn't help.
I don't know whether it will still exist in the next release 1.5, I wish not.






--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 9:55
??: Sea261810...@qq.com; Ted Yuyuzhih...@gmail.com; 
: user@spark.apache.orguser@spark.apache.org; 
rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



spark uses a lot more than heap memory, it is the expected behavior.in 1.4 
off-heap memory usage is supposed to grow in comparison to 1.3


Better use as little memory as you can for heap, and since you are not 
utilizing it already, it is safe for you to reduce it.
memoryFraction helps you optimize heap usage for your data/application profile 
while keeping it tight.



 






On Sun, Aug 2, 2015 at 12:54 PM Sea 261810...@qq.com wrote:

spark.storage.memoryFraction is in heap memory, but my situation is that the 
memory is more than heap memory !  


Anyone else use spark 1.4.1 in production? 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??8??2??(??) 5:45
??: Sea261810...@qq.com; 
: Barak Gitsisbar...@similarweb.com; 
user@spark.apache.orguser@spark.apache.org; rxinr...@databricks.com; 
joshrosenjoshro...@databricks.com; daviesdav...@databricks.com; 


: Re: About memory leak in spark 1.4.1




http://spark.apache.org/docs/latest/tuning.html does mention 
spark.storage.memoryFraction in two places.
One is under Cache Size Tuning section.


FYI


On Sun, Aug 2, 2015 at 2:16 AM, Sea 261810...@qq.com wrote:
Hi, Barak
It is ok with spark 1.3.0, the problem is with spark 1.4.1.
I don't think spark.storage.memoryFraction will make any sense, because it 
is still in heap memory. 




--  --
??: Barak Gitsis;bar...@similarweb.com;
: 2015??8??2??(??) 4:11
??: Sea261810...@qq.com; useruser@spark.apache.org; 
: rxinr...@databricks.com; joshrosenjoshro...@databricks.com; 
daviesdav...@databricks.com; 
: Re: About memory leak in spark 1.4.1



Hi,reducing spark.storage.memoryFraction did the trick for me. Heap doesn't get 
filled because it is reserved..
My reasoning is: 
I give executor all the memory i can give it, so that makes it a boundary.
From here i try to make the best use of memory I can. storage.memoryFraction is 
in a sense user data space.  The rest can be used by the system. 
If you don't have so much data that you MUST store in memory for performance, 
better give spark more space.. 
ended up setting it to 0.3


All that said, it is on spark 1.3 on cluster


hope that helps


On Sat, Aug 1, 2015 at 5:43 PM Sea 261810...@qq.com wrote:

Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply



-- 

-Barak








-- 

-Barak

About memory leak in spark 1.4.1

2015-08-01 Thread Sea
Hi, all
I upgrage spark to 1.4.1, many applications failed... I find the heap memory is 
not full , but the process of CoarseGrainedExecutorBackend will take more 
memory than I expect, and it will increase as time goes on, finally more than 
max limited of the server, the worker will die.


Any can help??


Mode??standalone


spark.executor.memory 50g


25583 xiaoju20   0 75.5g  55g  28m S 1729.3 88.1   2172:52 java


55g more than 50g I apply

回复: Asked to remove non-existent executor exception

2015-07-26 Thread Sea
This exception is so ugly!!!  The screen is full of these information when the 
program runs a long time,  and they will not fail the job. 
 
I comment it in the source code. I think this information is useless because 
the executor is already removed and I don't know what does the executor id mean.
 
Should we remove this information forever?
 
 
 
 15/07/23 13:26:41 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2...
 
15/07/23 13:26:41 ERROR SparkDeploySchedulerBackend: Asked to remove 
non-existent executor 2...






  

 

 -- 原始邮件 --
  发件人: Ted Yu;yuzhih...@gmail.com;
 发送时间: 2015年7月26日(星期天) 晚上10:51
 收件人: Pa Röpaul.roewer1...@googlemail.com; 
 抄送: useruser@spark.apache.org; 
 主题: Re: Asked to remove non-existent executor exception

 

 You can list the files in tmpfs in reverse chronological order and remove the 
oldest until you have enough space.  

 Cheers

 
 On Sun, Jul 26, 2015 at 12:43 AM, Pa Rö paul.roewer1...@googlemail.com wrote:
  i has seen that the tempfs is full, how i can clear this?
   
 2015-07-23 13:41 GMT+02:00 Pa Rö paul.roewer1...@googlemail.com:
 hello spark community,


i have build an application with geomesa, accumulo and spark.

if it run on spark local mode, it is working, but on spark

cluster not. in short it says: No space left on device. Asked to remove 
non-existent executor XY. 
I´m confused, because there were many GB´s of free space. do i need to change 
my configuration or what else can i do? thanks in advance.

here is the complete exception:

og4j:WARN No appenders could be found for logger 
(org.apache.accumulo.fate.zookeeper.ZooSession).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/07/23 13:26:39 INFO SparkContext: Running Spark version 1.3.0
15/07/23 13:26:39 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/07/23 13:26:39 INFO SecurityManager: Changing view acls to: marcel
15/07/23 13:26:39 INFO SecurityManager: Changing modify acls to: marcel
15/07/23 13:26:39 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(marcel); users 
with modify permissions: Set(marcel)
15/07/23 13:26:39 INFO Slf4jLogger: Slf4jLogger started
15/07/23 13:26:40 INFO Remoting: Starting remoting
15/07/23 13:26:40 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkDriver@node1-scads02:52478]
15/07/23 13:26:40 INFO Utils: Successfully started service 'sparkDriver' on 
port 52478.
15/07/23 13:26:40 INFO SparkEnv: Registering MapOutputTracker
15/07/23 13:26:40 INFO SparkEnv: Registering BlockManagerMaster
15/07/23 13:26:40 INFO DiskBlockManager: Created local directory at 
/tmp/spark-ca9319d4-68a2-4add-a21a-48b13ae9cf81/blockmgr-cbf8af23-e113-4732-8c2c-7413ad237b3b
15/07/23 13:26:40 INFO MemoryStore: MemoryStore started with capacity 1916.2 MB
15/07/23 13:26:40 INFO HttpFileServer: HTTP File server directory is 
/tmp/spark-9d4a04d5-3535-49e0-a859-d278a0cc7bf8/httpd-1882aafc-45fe-4490-803d-c04fc67510a2
15/07/23 13:26:40 INFO HttpServer: Starting HTTP Server
15/07/23 13:26:40 INFO Server: jetty-8.y.z-SNAPSHOT
15/07/23 13:26:40 INFO AbstractConnector: Started SocketConnector@0.0.0.0:56499
15/07/23 13:26:40 INFO Utils: Successfully started service 'HTTP file server' 
on port 56499.
15/07/23 13:26:40 INFO SparkEnv: Registering OutputCommitCoordinator
15/07/23 13:26:40 INFO Server: jetty-8.y.z-SNAPSHOT
15/07/23 13:26:40 INFO AbstractConnector: Started 
SelectChannelConnector@0.0.0.0:4040
15/07/23 13:26:40 INFO Utils: Successfully started service 'SparkUI' on port 
4040.
15/07/23 13:26:40 INFO SparkUI: Started SparkUI at http://node1-scads02:4040
15/07/23 13:26:40 INFO AppClient$ClientActor: Connecting to master 
akka.tcp://sparkMaster@node1-scads02:7077/user/Master...
15/07/23 13:26:40 INFO SparkDeploySchedulerBackend: Connected to Spark cluster 
with app ID app-20150723132640-
15/07/23 13:26:40 INFO AppClient$ClientActor: Executor added: 
app-20150723132640-/0 on worker-20150723132524-node3-scads06-7078 
(node3-scads06:7078) with 8 cores
15/07/23 13:26:40 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20150723132640-/0 on hostPort node3-scads06:7078 with 8 cores, 512.0 MB 
RAM
15/07/23 13:26:40 INFO AppClient$ClientActor: Executor added: 
app-20150723132640-/1 on worker-20150723132513-node2-scads05-7078 
(node2-scads05:7078) with 8 cores
15/07/23 13:26:40 INFO SparkDeploySchedulerBackend: Granted executor ID 
app-20150723132640-/1 on hostPort node2-scads05:7078 with 8 cores, 512.0 MB 
RAM
15/07/23 13:26:40 INFO AppClient$ClientActor: Executor updated: 
app-20150723132640-/0 is now RUNNING
15/07/23 13:26:40 INFO AppClient$ClientActor: Executor updated: 

About extra memory on yarn mode

2015-07-14 Thread Sea
Hi all:
I have a question about why spark on yarn will need extra memory
I apply for 10 executors, executor memory 6g,  I find that it will allocate 1g 
more for 1 executor, totally 7g for 1 executor.
I try to set spark.yarn.executor.memoryOverhead, but it did not help.
1g for 1 executor is too much, who knows how to adjust its size?

[SPARK-SQL] libgplcompression.so already loaded in another classloader

2015-07-07 Thread Sea
Hi, all
I found an Exception when using spark-sql 
java.lang.UnsatisfiedLinkError: Native Library 
/data/lib/native/libgplcompression.so already loaded in another classloader ...


I set  spark.sql.hive.metastore.jars=.  in file spark-defaults.conf


It does not happen every time. Who knows why?


Spark version: 1.4.0
Hadoop version: 2.2.0

回复: Uncaught exception in thread delete Spark local dirs

2015-06-27 Thread Sea
SPARK_CLASSPATH is nice, spark.jars needs to list all the jars one by one when 
submitting to yarn because spark.driver.classpath and spark.executor.classpath 
are not available in yarn mode. Can someone remove the warnning from the code 
or upload the jar in spark.driver.classpath and spark.executor.classpath? ?




-- 原始邮件 --
发件人: Tathagata Das;t...@databricks.com;
发送时间: 2015年6月27日(星期六) 下午5:36
收件人: Guillermo Ortizkonstt2...@gmail.com; 
抄送: Hari Shreedharanhshreedha...@cloudera.com; 
useruser@spark.apache.org; 
主题: Re: Uncaught exception in thread delete Spark local dirs



Well, though randomly chosen, SPARK_CLASSPATH is a recognized env variable that 
is picked up by spark-submit. That is what was used pre-Spark-1.0, but got 
deprecated after that. Mind renamign that variable and trying it out again? At 
least it will reduce one possible source of problem.

TD


On Sat, Jun 27, 2015 at 2:32 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
I'm checking the logs in YARN and I found this error as well
Application application_1434976209271_15614 failed 2 times due to AM Container 
for appattempt_1434976209271_15614_02 exited with exitCode: 255




Diagnostics: Exception from container-launch.
Container id: container_1434976209271_15614_02_01
Exit code: 255
Stack trace: ExitCodeException exitCode=255:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
at org.apache.hadoop.util.Shell.run(Shell.java:455)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:715)
at 
org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:293)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Shell output: Requested user hdfs is not whitelisted and has id 496,which is 
below the minimum allowed 1000


Container exited with a non-zero exit code 255
Failing this attempt. Failing the application.




2015-06-27 11:25 GMT+02:00 Guillermo Ortiz konstt2...@gmail.com:
Well SPARK_CLASSPATH it's just a random name, the complete script is this:


export HADOOP_CONF_DIR=/etc/hadoop/conf
SPARK_CLASSPATH=file:/usr/metrics/conf/elasticSearch.properties,file:/usr/metrics/conf/redis.properties,/etc/spark/conf.cloudera.spark_on_yarn/yarn-conf/
for lib in `ls /usr/metrics/lib/*.jar`
do
if [ -z $SPARK_CLASSPATH ]; then
SPARK_CLASSPATH=$lib
else
SPARK_CLASSPATH=$SPARK_CLASSPATH,$lib
fi
done 
spark-submit --name Metrics


I need to add all the jars as you know,, maybe it was a bad name SPARK_CLASSPATH


The code doesn't have any stateful operation, yo I guess that it¡s okay doesn't 
have checkpoint. I have executed hundres of times thiscode in VM from Cloudera 
and never got this error.


2015-06-27 11:21 GMT+02:00 Tathagata Das t...@databricks.com:
1. you need checkpointing mostly for recovering from driver failures, and in 
some cases also for some stateful operations.

2. Could you try not using the SPARK_CLASSPATH environment variable. 


TD


On Sat, Jun 27, 2015 at 1:00 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
I don't have any checkpoint on my code. Really, I don't have to save any state. 
It's just a log processing of a PoC.I have been testing the code in a VM from 
Cloudera and I never got that error.. Not it's a real cluster.


The command to execute Spark 
spark-submit --name PoC Logs --master yarn-client --class 
com.metrics.MetricsSpark --jars $SPARK_CLASSPATH --executor-memory 1g 
/usr/metrics/ex/metrics-spark.jar $1 $2 $3


val sparkConf = new SparkConf()
val ssc = new StreamingContext(sparkConf, Seconds(5))  
val kafkaParams = Map[String, String](metadata.broker.list - args(0))
val topics = args(1).split(\\,)
val directKafkaStream = KafkaUtils.createDirectStream[String, String, 
StringDecoder, StringDecoder](ssc, kafkaParams, topics.toSet)


directKafkaStream.foreachRDD { rdd =
  val offsets = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  val documents = rdd.mapPartitionsWithIndex { (i, kafkaEvent) =
  .
   }


I understand that I just need a checkpoint if I need to recover the task it 
something goes wrong, right?





2015-06-27 9:39 GMT+02:00 Tathagata Das t...@databricks.com:
How are you trying to execute the code again? From checkpoints, or 
otherwise?Also cc'ed Hari who may have a better idea of YARN related issues.


On Sat, Jun 27, 2015 at 12:35 AM, Guillermo Ortiz konstt2...@gmail.com wrote:
Hi,

I'm executing a SparkStreamig code with Kafka. IçThe 

?????? Time is ugly in Spark Streaming....

2015-06-27 Thread Sea
Yes , things go well now.  It is a problem of SimpleDateFormat. Thank you all.




--  --
??: Dumas Hwang;dumas.hw...@gmail.com;
: 2015??6??27??(??) 8:16
??: Tathagata Dast...@databricks.com; 
: Emrehan T??z??nemrehan.tu...@gmail.com; Sea261810...@qq.com; 
devd...@spark.apache.org; useruser@spark.apache.org; 
: Re: Time is ugly in Spark Streaming



Java's SimpleDateFormat is not thread safe.  You can consider using 
DateTimeFormatter if you are using Java 8 or Joda-time

On Sat, Jun 27, 2015 at 3:32 AM, Tathagata Das t...@databricks.com wrote:
Could you print the time on the driver (that is, in foreachRDD but before 
RDD.foreachPartition) and see if it is behaving weird?

TD


On Fri, Jun 26, 2015 at 3:57 PM, Emrehan T??z??n emrehan.tu...@gmail.com 
wrote:
 





On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810...@qq.com wrote:

 Hi, all
 

 I find a problem in spark streaming, when I use the time in function 
foreachRDD... I find the time is very interesting. 
 val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)
 dataStream.map(x = createGroup(x._2, 
dimensions)).groupByKey().foreachRDD((rdd, time) = {
try {
if (!rdd.partitions.isEmpty) {
  rdd.foreachPartition(partition = {
handlePartition(partition, timeType, time, dimensions, outputTopic, brokerList)
  })
}
  } catch {
case e: Exception = e.printStackTrace()
  }
})
 

 val dateFormat = new SimpleDateFormat(-MM-dd'T'HH:mm:ss)
  var date = dateFormat.format(new Date(time.milliseconds)) 
 
 Then I insert the 'date' into Kafka , but I found .
  

 
{timestamp:2015-06-00T16:50:02,status:3,type:1,waittime:0,count:17}
 
{timestamp:2015-06-26T16:51:13,status:1,type:1,waittime:0,count:34}
 
{timestamp:2015-06-00T16:50:02,status:4,type:0,waittime:0,count:279}
 
{timestamp:2015-06-26T16:52:00,status:11,type:1,waittime:0,count:9}
 
{timestamp:0020-06-26T16:50:36,status:7,type:0,waittime:0,count:1722}
 
{timestamp:2015-06-10T16:51:17,status:0,type:0,waittime:0,count:2958}
 
{timestamp:2015-06-26T16:52:00,status:0,type:1,waittime:0,count:114}
 
{timestamp:2015-06-10T16:51:17,status:11,type:0,waittime:0,count:2066}
 
{timestamp:2015-06-26T16:52:00,status:1,type:0,waittime:0,count:1539}

?????? Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Yes, I make it.




--  --
??: Gerard Maas;gerard.m...@gmail.com;
: 2015??6??26??(??) 5:40
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; devd...@spark.apache.org; 
: Re: Time is ugly in Spark Streaming



Are you sharing the SimpleDateFormat instance? This looks a lot more like the 
non-thread-safe behaviour of SimpleDateFormat (that has claimed many 
unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try 
writing the timestamps in millis to Kafka and compare.

-kr, Gerard.


On Fri, Jun 26, 2015 at 11:06 AM, Sea 261810...@qq.com wrote:
Hi, all


I find a problem in spark streaming, when I use the time in function 
foreachRDD... I find the time is very interesting.
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, 
StringDecoder](ssc, kafkaParams, topicsSet)
dataStream.map(x = createGroup(x._2, 
dimensions)).groupByKey().foreachRDD((rdd, time) = {
  try {
if (!rdd.partitions.isEmpty) {
  rdd.foreachPartition(partition = {
handlePartition(partition, timeType, time, dimensions, outputTopic, 
brokerList)
  })
}
  } catch {
case e: Exception = e.printStackTrace()
  }
})


val dateFormat = new SimpleDateFormat(-MM-dd'T'HH:mm:ss)
var date = dateFormat.format(new Date(time.milliseconds))

Then I insert the 'date' into Kafka , but I found .


{timestamp:2015-06-00T16:50:02,status:3,type:1,waittime:0,count:17}
{timestamp:2015-06-26T16:51:13,status:1,type:1,waittime:0,count:34}
{timestamp:2015-06-00T16:50:02,status:4,type:0,waittime:0,count:279}
{timestamp:2015-06-26T16:52:00,status:11,type:1,waittime:0,count:9}
{timestamp:0020-06-26T16:50:36,status:7,type:0,waittime:0,count:1722}
{timestamp:2015-06-10T16:51:17,status:0,type:0,waittime:0,count:2958}
{timestamp:2015-06-26T16:52:00,status:0,type:1,waittime:0,count:114}
{timestamp:2015-06-10T16:51:17,status:11,type:0,waittime:0,count:2066}
{timestamp:2015-06-26T16:52:00,status:1,type:0,waittime:0,count:1539}

How to use an different version of hive

2015-06-21 Thread Sea
Hi, all:
We have an own version of hive 0.13.1, we alter the code about permissions of 
operating table and an issue of hive 0.13.1 HIVE-6131
Spark 1.4.0 support different versions of hive metastore,  who can give an 
example?
I am confused of these

spark.sql.hive.metastore.jars 
spark.sql.hive.metastore.sharedPrefixes
spark.sql.hive.metastore.barrierPrefixes 


Does it support yarn-client?

Re?? Abount Jobs UI in yarn-client mode

2015-06-21 Thread Sea
Thanks?? it is ok now??




--  --
??: Gavin Yue;yue.yuany...@gmail.com;
: 2015??6??21??(??) 4:40
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; 
: Re: Abount Jobs UI in yarn-client mode



I got the same problem when I upgrade from 1.3.1 to 1.4. 


The same Conf has been used, 1.3 works, but 1.4UI does not work. 


So I added the   
property
nameyarn.resourcemanager.webapp.address/name
value:8088/value
  /property
  property
nameyarn.resourcemanager.hostname/name
value/value
  /property


To yarn-site.xml.  The problem solved. 


Spark 1.4 + Yarn 2.7 + Java 8 


On Fri, Jun 19, 2015 at 8:48 AM, Sea 261810...@qq.com wrote:
Hi, all:
I run spark on yarn,  I want to see the Jobs UI http://ip:4040/, 
but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ 
which can not be found. Why?
Anyone can help?

Abount Jobs UI on yarn-client mode

2015-06-19 Thread Sea
Hi, all:
I run spark on yarn,  I want to see the Jobs UI http://ip:4040/, 
but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ 
which can not be found. Why?
Anyone can help?

Abount Jobs UI in yarn-client mode

2015-06-19 Thread Sea
Hi, all:
I run spark on yarn,  I want to see the Jobs UI http://ip:4040/, 
but it redirect to http://${yarn.ip}/proxy/application_1428110196022_924324/ 
which can not be found. Why?
Anyone can help?

Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all:


I want to run spark sql on yarn(yarn-client), but ... I already set 
spark.yarn.jar and  spark.jars in conf/spark-defaults.conf.
./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100  game.txt
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/deploy/yarn/ExecutorLauncher
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.ExecutorLauncher
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.yarn.ExecutorLauncher.  
Program will exit.





Anyone can help?

Exception in using updateStateByKey

2015-04-27 Thread Sea
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
I make it to 240, it happens again when 240 seconds is reached.




--  --
??: 261810726;261810...@qq.com;
: 2015??4??27??(??) 10:24
??: Ted Yuyuzhih...@gmail.com; 

: ?? Exception in using updateStateByKey



Yes??I can make it larger, but I also want to know whether there is a formula 
to estimate it




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 10:20
??: Sea261810...@qq.com; 

: Re: Exception in using updateStateByKey



Can you make the value for spark.cleaner.ttl larger ?Cheers


On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote:
my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is 
that?? when  the checkpoint info is deleted(it depends on  
??spark.cleaner.ttl??)??it will throw this exception??
 



-  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 9:55
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; 
: Re: Exception in using updateStateByKey



Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted 
spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 ?


Cheers


On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

?????? Exception in using updateStateByKey

2015-04-27 Thread Sea
Maybe I found the solution??do not set 'spark.cleaner.ttl', just use function 
'remember' in StreamingContext to set the rememberDuration. 




--  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 10:20
??: Sea261810...@qq.com; 

: Re: Exception in using updateStateByKey



Can you make the value for spark.cleaner.ttl larger ?Cheers


On Mon, Apr 27, 2015 at 7:13 AM, Sea 261810...@qq.com wrote:
my hadoop version is 2.2.0?? the hdfs-audit.log is too large?? The problem is 
that?? when  the checkpoint info is deleted(it depends on  
??spark.cleaner.ttl??)??it will throw this exception??
 



-  --
??: Ted Yu;yuzhih...@gmail.com;
: 2015??4??27??(??) 9:55
??: Sea261810...@qq.com; 
: useruser@spark.apache.org; 
: Re: Exception in using updateStateByKey



Which hadoop release are you using ?

Can you check hdfs audit log to see who / when deleted 
spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443 ?


Cheers


On Mon, Apr 27, 2015 at 6:21 AM, Sea 261810...@qq.com wrote:
Hi, all:
I use function updateStateByKey in Spark Streaming, I need to store the states 
for one minite,  I set spark.cleaner.ttl to 120, the duration is 2 seconds, 
but it throws Exception 




Caused by: 
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does 
not exist: spark/ck/hdfsaudit/receivedData/0/log-1430139541443-1430139601443
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at 
org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:51)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsUpdateTimes(FSNamesystem.java:1499)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1448)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1428)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1402)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:468)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:269)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59566)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)


at org.apache.hadoop.ipc.Client.call(Client.java:1347)
at org.apache.hadoop.ipc.Client.call(Client.java:1300)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
at com.sun.proxy.$Proxy14.getBlockLocations(Unknown Source)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:188)
at sun.reflect.GeneratedMethodAccessor21.invoke(Unknown Source)



Why?


my code is 


ssc = StreamingContext(sc,2)
kvs = KafkaUtils.createStream(ssc, zkQuorum, group, {topic: 1})
kvs.window(60,2).map(lambda x: analyzeMessage(x[1]))\
.filter(lambda x: x[1] != None).updateStateByKey(updateStateFunc) \
.filter(lambda x: x[1]['isExisted'] != 1) \
.foreachRDD(lambda rdd: rdd.foreachPartition(insertIntoDb))

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all:




When I exit the console of spark-sql, the following exception throwed..


My spark version is 1.3.0, hadoop version is 2.2.0


Exception in thread Thread-3 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1677)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:196)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)‍

Filesystem closed Exception

2015-03-20 Thread Sea
Hi, all:




When I exit the console of spark-sql, the following exception throwed..


Exception in thread Thread-3 java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:629)
at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:1677)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1106)
at 
org.apache.hadoop.hdfs.DistributedFileSystem$17.doCall(DistributedFileSystem.java:1102)
at 
org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1102)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1397)
at 
org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:196)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at 
org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1388)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.stop(SparkSQLEnv.scala:66)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$$anon$1.run(SparkSQLCLIDriver.scala:107)‍

InvalidAuxServiceException in dynamicAllocation

2015-03-17 Thread Sea
Hi, all:


Spark1.3.0 hadoop2.2.0


I put the following params in the spark-defaults.conf 


spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 20
spark.dynamicAllocation.maxExecutors 300
spark.dynamicAllocation.executorIdleTimeout 300
spark.shuffle.service.enabled true‍



I started the thriftserver and do a query. Exception happened!
I find it in JIRA https://issues.apache.org/jira/browse/SPARK-5759‍ 
It says fixed version 1.3.0


Caused by: org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The 
auxService:spark_shuffle does not existat 
sun.reflect.GeneratedConstructorAccessor28.newInstance(Unknown Source)   at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:513) 
 at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:152)
at 
org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
 at 
org.apache.hadoop.yarn.client.api.impl.NMClientImpl.startContainer(NMClientImpl.java:203)
at 
org.apache.spark.deploy.yarn.ExecutorRunnable.startContainer(ExecutorRunnable.scala:113)
 ... 4 more‍