Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
Running 'lsof' will let us know the open files but how do we come to know
the root cause behind opening too many files.

Thanks,
Padma CH

On Wed, Jan 6, 2016 at 8:39 AM, Hamel Kothari 
wrote:

> The "Too Many Files" part of the exception is just indicative of the fact
> that when that call was made, too many files were already open. It doesn't
> necessarily mean that that line is the source of all of the open files,
> that's just the point at which it hit its limit.
>
> What I would recommend is to try to run this code again and use "lsof" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
> On Tue, Jan 5, 2016 at 9:48 PM Priya Ch 
> wrote:
>
>> Yes, the fileinputstream is closed. May be i didn't show in the screen
>> shot .
>>
>> As spark implements, sort-based shuffle, there is a parameter called
>> maximum merge factor which decides the number of files that can be merged
>> at once and this avoids too many open files. I am suspecting that it is
>> something related to this.
>>
>> Can someone confirm on this ?
>>
>> On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <
>> melongo_anna...@yahoo.com> wrote:
>>
>>> Vijay,
>>>
>>> Are you closing the fileinputstream at the end of each loop (
>>> in.close())? My guess is those streams aren't close and thus the "too many
>>> open files" exception.
>>>
>>>
>>> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
>>> learnings.chitt...@gmail.com> wrote:
>>>
>>>
>>> Can some one throw light on this ?
>>>
>>> Regards,
>>> Padma Ch
>>>
>>> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
>>> wrote:
>>>
>>> Chris, we are using spark 1.3.0 version. we have not set  
>>> spark.streaming.concurrentJobs
>>> this parameter. It takes the default value.
>>>
>>> Vijay,
>>>
>>>   From the tack trace it is evident that 
>>> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
>>> is throwing the exception. I opened the spark source code and visited the
>>> line which is throwing this exception i.e
>>>
>>> [image: Inline image 1]
>>>
>>> The lie which is marked in red is throwing the exception. The file is
>>> ExternalSorter.scala in org.apache.spark.util.collection package.
>>>
>>> i went through the following blog
>>> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
>>> and understood that there is merge factor which decide the number of
>>> on-disk files that could be merged. Is it some way related to this ?
>>>
>>> Regards,
>>> Padma CH
>>>
>>> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>>>
>>> and which version of Spark/Spark Streaming are you using?
>>>
>>> are you explicitly setting the spark.streaming.concurrentJobs to
>>> something larger than the default of 1?
>>>
>>> if so, please try setting that back to 1 and see if the problem still
>>> exists.
>>>
>>> this is a dangerous parameter to modify from the default - which is why
>>> it's not well-documented.
>>>
>>>
>>> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
>>> wrote:
>>>
>>> Few indicators -
>>>
>>> 1) during execution time - check total number of open files using lsof
>>> command. Need root permissions. If it is cluster not sure much !
>>> 2) which exact line in the code is triggering this error ? Can you paste
>>> that snippet ?
>>>
>>>
>>> On Wednesday 23 December 2015, Priya Ch 
>>> wrote:
>>>
>>> ulimit -n 65000
>>>
>>> fs.file-max = 65000 ( in etc/sysctl.conf file)
>>>
>>> Thanks,
>>> Padma Ch
>>>
>>> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>>>
>>> Could you share the ulimit for your setup please ?
>>> - Thanks, via mobile,  excuse brevity.
>>> On Dec 22, 2015 6:39 PM, "Priya Ch" 
>>> wrote:
>>>
>>> Jakob,
>>>
>>>Increased the settings like fs.file-max in /etc/sysctl.conf and also
>>> increased user limit in /etc/security/limits.conf. But still see the
>>> same issue.
>>>
>>> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
>>> wrote:
>>>
>>> It might be a good idea to see how many files are open and try
>>> increasing the open file limit (this is done on an os level). In some
>>> application use-cases it is actually a legitimate need.
>>>
>>> If that doesn't help, make sure you close any unused files and streams
>>> in your code. It will also be easier to help diagnose the issue if you send
>>> an error-reproducing snippet.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Vijay Gharge
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Chris Fregly*
>>> Principal Data Solutions Engineer
>>> IBM Spark Technology Center, San Francisco, CA
>>> http://spark.tc | 

RE: Out of memory issue

2016-01-06 Thread Ewan Leith
Hi Muthu, this could be related to a known issue in the release notes

http://spark.apache.org/releases/spark-release-1-6-0.html

Known issues

SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions may 
cause OOM; this can be worked around by decreasing the memory used by both 
Spark and Parquet using spark.memory.fraction (for example, 0.4) and 
parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g. 
setting it in core-site.xml).

It's definitely worth setting spark.memory.fraction and 
parquet.memory.pool.ratio and trying again.

Ewan

-Original Message-
From: babloo80 [mailto:bablo...@gmail.com] 
Sent: 06 January 2016 03:44
To: user@spark.apache.org
Subject: Out of memory issue

Hello there,

I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in 
different stages of execution and creates a result parquet of 9 GB (about 27 
million rows containing 165 columns. some columns are map based containing 
utmost 200 value histograms). The stages involve, Step 1: Reading the data 
using dataframe api Step 2: Transform dataframe to RDD (as the some of the 
columns are transformed into histograms (using empirical distribution to cap 
the number of keys) and some of them run like UDAF during reduce-by-key step) 
to perform and perform some transformations Step 3: Reduce the result by key so 
that the resultant can be used in the next stage for join Step 4: Perform left 
outer join of this result which runs similar Steps 1 thru 3. 
Step 5: The results are further reduced to be written to parquet

With Apache Spark 1.5.2, I am able to run the job with no issues.
Current env uses 8 nodes running a total of  320 cores, 100 GB executor memory 
per node with driver program using 32 GB. The approximate execution time is 
about 1.2 hrs. The parquet files are stored in another HDFS cluster for read 
and eventual write of the result.

When the same job is executed using Apache 1.6.0, some of the executor node's 
JVM gets restarted (with a new executor id). On further turning-on GC stats on 
the executor, the perm-gen seem to get maxed out and ends up showing the 
symptom of out-of-memory. 

Please advice on where to start investigating this issue. 

Thanks,
Muthu



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark DataFrame limit question

2016-01-06 Thread Arkadiusz Bicz
Hi,

Does limit working for DataFrames, Spark SQL and Hive Context without
full scan for parquet in Spark 1.6 ?

I just used it to create small parquet file from large number of
parquet files and found out that it doing full scan of all data
instead just read limited number:

All of bellow commands doing full scan

val results = sqlContext.read.load("/largenumberofparquetfiles/")

results.limit(1).write.parquet("/tmp/smallresults1")

result.registerTempTable("resultTemp")

val select = sqlContext.sql("select * from resultTemp limit 1")

select.write.parquet("/tmp/smallresults2")

The same when I create external table in hive context as results table

hiveContext.sql("select * from results limit
1").write.parquet("/tmp/results/one3")


Thanks,

Arkadiusz Bicz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: How to accelerate reading json file?

2016-01-06 Thread Ewan Leith
If you already know the schema, then you can run the read with the schema 
parameter like this:


val path = "examples/src/main/resources/jsonfile"

val jsonSchema =  StructType(
StructField("id",StringType,true) ::
StructField("reference",LongType,true) ::
StructField("details",detailsSchema, true) ::
StructField("value",StringType,true) ::Nil)

val people = sqlContext.read.schema(jsonSchema).json(path)
If you have the schema defined as a separate small JSON file, then you can load 
it by running something like this line to load it directly:

val jsonSchema = sqlContext.read.json(“path/to/schema”).schema

Thanks,
Ewan

From: Gavin Yue [mailto:yue.yuany...@gmail.com]
Sent: 06 January 2016 07:14
To: user 
Subject: How to accelerate reading json file?

I am trying to read json files following the example:

val path = "examples/src/main/resources/jsonfile"

val people = sqlContext.read.json(path)

I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading to 
infer the schema.

But I already know the schema. Could I make this process short?

Thanks a lot.





How to insert df in HBASE

2016-01-06 Thread Sadaf
HI,

I need to insert a Dataframe in to hbase using scala code.
Can anyone guide me how to achieve this?

Any help would be much appreciated. 
Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-df-in-HBASE-tp25891.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to concat few rows into a new column in dataframe

2016-01-06 Thread Sabarish Sasidharan
You can just repartition by the id, if the final objective is to have all
data for the same key in the same partition.

Regards
Sab

On Wed, Jan 6, 2016 at 11:02 AM, Gavin Yue  wrote:

> I found that in 1.6 dataframe could do repartition.
>
> Should I still need to do orderby first or I just have to repartition?
>
>
>
>
> On Tue, Jan 5, 2016 at 9:25 PM, Gavin Yue  wrote:
>
>> I tried the Ted's solution and it works.   But I keep hitting the JVM out
>> of memory problem.
>> And grouping the key causes a lot of  data shuffling.
>>
>> So I am trying to order the data based on ID first and save as Parquet.
>> Is there way to make sure that the data is partitioned that each ID's data
>> is in one partition, so there would be no shuffling in the future?
>>
>> Thanks.
>>
>>
>> On Tue, Jan 5, 2016 at 3:19 PM, Michael Armbrust 
>> wrote:
>>
>>> This would also be possible with an Aggregator in Spark 1.6:
>>>
>>> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Dataset%20Aggregator.html
>>>
>>> On Tue, Jan 5, 2016 at 2:59 PM, Ted Yu  wrote:
>>>
 Something like the following:

 val zeroValue = collection.mutable.Set[String]()

 val aggredated = data.aggregateByKey (zeroValue)((set, v) => set += v,
 (setOne, setTwo) => setOne ++= setTwo)

 On Tue, Jan 5, 2016 at 2:46 PM, Gavin Yue 
 wrote:

> Hey,
>
> For example, a table df with two columns
> id  name
> 1   abc
> 1   bdf
> 2   ab
> 2   cd
>
> I want to group by the id and concat the string into array of string.
> like this
>
> id
> 1 [abc,bdf]
> 2 [ab, cd]
>
> How could I achieve this in dataframe?  I stuck on df.groupBy("id").
> ???
>
> Thanks
>
>

>>>
>>
>


-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++


Re: [Spark on YARN] Multiple Auxiliary Shuffle Service Versions

2016-01-06 Thread Deenar Toraskar
Hi guys


   1. >> Add this jar to the classpath of all NodeManagers in your cluster.


A related question on configuration of the auxillary shuffle service. *How
do i find the classpath for NodeManager?* I tried finding all places where
the existing mapreduce shuffle jars are present and place the spark yarn
shuffle jar in the same location, but with no success.

$ find . -name *shuffle*.jar
./hadoop/client/hadoop-mapreduce-client-shuffle.jar
./hadoop/client/hadoop-mapreduce-client-shuffle-2.7.1.2.3.2.0-2950.jar
./hadoop/client/spark-1.6.0-SNAPSHOT-yarn-shuffle.jar
./hadoop-mapreduce/hadoop-mapreduce-client-shuffle.jar
./hadoop-mapreduce/hadoop-mapreduce-client-shuffle-2.7.1.2.3.2.0-2950.jar
./falcon/client/lib/hadoop-mapreduce-client-shuffle-2.7.1.2.3.2.0-2950.jar
./oozie/libserver/hadoop-mapreduce-client-shuffle-2.7.1.2.3.2.0-2950.jar
./oozie/libtools/hadoop-mapreduce-client-shuffle-2.7.1.2.3.2.0-2950.jar
./spark/lib/spark-1.4.1.2.3.2.0-2950-yarn-shuffle.jar
Regards
Deenar

On 7 October 2015 at 01:27, Alex Rovner  wrote:

> Thank you all for your help.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Tue, Oct 6, 2015 at 11:17 AM, Steve Loughran 
> wrote:
>
>>
>> On 6 Oct 2015, at 01:23, Andrew Or  wrote:
>>
>> Both the history server and the shuffle service are backward compatible,
>> but not forward compatible. This means as long as you have the latest
>> version of history server / shuffle service running in your cluster then
>> you're fine (you don't need multiple of them).
>>
>>
>> FWIW I've just created a JIRA on tracking/reporting version mismatch on
>> history server playback better:
>> https://issues.apache.org/jira/browse/SPARK-10950
>>
>> Even though the UI can't be expected to playback later histories, it
>> could be possible to report the issue in a way that users can act on "run a
>> later version", rather than raise support calls.
>>
>>
>


spark 1.6 Issue

2016-01-06 Thread kali.tumm...@gmail.com
Hi All, 

I am running my app in IntelliJ Idea (locally) my config local[*] , the code
worked ok with spark 1.5 but when I upgraded to 1.6 I am having below issue.

is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error
do I need to pass executor memory while running in local in spark 1.6 ?

Exception in thread "main" java.lang.IllegalArgumentException: System memory
259522560 must be at least 4.718592E8. Please use a larger heap size.

Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Kostiantyn Kudriavtsev
Hi guys,

the only one big issue with this approach:
> spark.hadoop.s3a.access.key  is now visible everywhere, in logs, in spark 
> webui and is not secured at all...

On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev 
 wrote:

> thanks Jerry, it works!
> really appreciate your help 
> 
> Thank you,
> Konstantin Kudryavtsev
> 
> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam  wrote:
> Hi Kostiantyn,
> 
> You should be able to use spark.conf to specify s3a keys.
> 
> I don't remember exactly but you can add hadoop properties by prefixing 
> spark.hadoop.*
> * is the s3a properties. For instance,
> 
> spark.hadoop.s3a.access.key wudjgdueyhsj
> 
> Of course, you need to make sure the property key is right. I'm using my 
> phone so I cannot easily verifying.
> 
> Then you can specify different user using different spark.conf via 
> --properties-file when spark-submit
> 
> HTH,
> 
> Jerry
> 
> Sent from my iPhone
> 
> On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
>  wrote:
> 
>> Hi Jerry,
>> 
>> what you suggested looks to be working (I put hdfs-site.xml into 
>> $SPARK_HOME/conf folder), but could you shed some light on how it can be 
>> federated per user?
>> Thanks in advance!
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
>> Hi Kostiantyn,
>> 
>> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
>> could define different spark-{user-x}.conf and source them during 
>> spark-submit. let us know if hdfs-site.xml works first. It should.
>> 
>> Best Regards,
>> 
>> Jerry
>> 
>> Sent from my iPhone
>> 
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>>> Hi Jerry,
>>> 
>>> I want to run different jobs on different S3 buckets - different AWS creds 
>>> - on the same instances. Could you shed some light if it's possible to 
>>> achieve with hdfs-site?
>>> 
>>> Thank you,
>>> Konstantin Kudryavtsev
>>> 
>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> Can you define those properties in hdfs-site.xml and make sure it is 
>>> visible in the class path when you spark-submit? It looks like a conf 
>>> sourcing issue to me. 
>>> 
>>> Cheers,
>>> 
>>> Sent from my iPhone
>>> 
>>> On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
>>>  wrote:
>>> 
 Chris,
 
 thanks for the hist with AIM roles, but in my case  I need to run 
 different jobs with different S3 permissions on the same cluster, so this 
 approach doesn't work for me as far as I understood it
 
 Thank you,
 Konstantin Kudryavtsev
 
 On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  wrote:
 couple things:
 
 1) switch to IAM roles if at all possible - explicitly passing AWS 
 credentials is a long and lonely road in the end
 
 2) one really bad workaround/hack is to run a job that hits every worker 
 and writes the credentials to the proper location (~/.awscredentials or 
 whatever)
 
 ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
 autoscaling, but i'm mentioning it anyway as it is a temporary fix.
 
 if you switch to IAM roles, things become a lot easier as you can 
 authorize all of the EC2 instances in the cluster - and handles 
 autoscaling very well - and at some point, you will want to autoscale.
 
 On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
  wrote:
 Chris,
 
  good question, as you can see from the code I set up them on driver, so I 
 expect they will be propagated to all nodes, won't them?
 
 Thank you,
 Konstantin Kudryavtsev
 
 On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  wrote:
 are the credentials visible from each Worker node to all the Executor JVMs 
 on each Worker?
 
 On Dec 30, 2015, at 12:45 PM, KOSTIANTYN Kudriavtsev 
  wrote:
 
> Dear Spark community,
> 
> I faced the following issue with trying accessing data on S3a, my code is 
> the following:
> 
> val sparkConf = new SparkConf()
> 
> val sc = new SparkContext(sparkConf)
> sc.hadoopConfiguration.set("fs.s3a.impl", 
> "org.apache.hadoop.fs.s3a.S3AFileSystem")
> sc.hadoopConfiguration.set("fs.s3a.access.key", "---")
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "---")
> val sqlContext = SQLContext.getOrCreate(sc)
> val df = sqlContext.read.parquet(...)
> df.count
> 
> It results in the following exception and log messages:
> 15/12/30 17:00:32 DEBUG AWSCredentialsProviderChain: Unable to load 
> credentials from BasicAWSCredentialsProvider: Access key or secret key is 

Re: SparkSQL integration issue with AWS S3a

2016-01-06 Thread Jerry Lam
Hi Kostiantyn,

Yes. If security is a concern then this approach cannot satisfy it. The keys 
are visible in the properties files. If the goal is to hide them, you might be 
able go a bit further with this approach. Have you look at spark security page?

Best Regards,

Jerry 

Sent from my iPhone

> On 6 Jan, 2016, at 8:49 am, Kostiantyn Kudriavtsev 
>  wrote:
> 
> Hi guys,
> 
> the only one big issue with this approach:
>>> spark.hadoop.s3a.access.key  is now visible everywhere, in logs, in spark 
>>> webui and is not secured at all...
> 
>> On Jan 2, 2016, at 11:13 AM, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>> thanks Jerry, it works!
>> really appreciate your help 
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Fri, Jan 1, 2016 at 4:35 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> You should be able to use spark.conf to specify s3a keys.
>>> 
>>> I don't remember exactly but you can add hadoop properties by prefixing 
>>> spark.hadoop.*
>>> * is the s3a properties. For instance,
>>> 
>>> spark.hadoop.s3a.access.key wudjgdueyhsj
>>> 
>>> Of course, you need to make sure the property key is right. I'm using my 
>>> phone so I cannot easily verifying.
>>> 
>>> Then you can specify different user using different spark.conf via 
>>> --properties-file when spark-submit
>>> 
>>> HTH,
>>> 
>>> Jerry
>>> 
>>> Sent from my iPhone
>>> 
 On 31 Dec, 2015, at 2:06 pm, KOSTIANTYN Kudriavtsev 
  wrote:
 
 Hi Jerry,
 
 what you suggested looks to be working (I put hdfs-site.xml into 
 $SPARK_HOME/conf folder), but could you shed some light on how it can be 
 federated per user?
 Thanks in advance!
 
 Thank you,
 Konstantin Kudryavtsev
 
> On Wed, Dec 30, 2015 at 2:37 PM, Jerry Lam  wrote:
> Hi Kostiantyn,
> 
> I want to confirm that it works first by using hdfs-site.xml. If yes, you 
> could define different spark-{user-x}.conf and source them during 
> spark-submit. let us know if hdfs-site.xml works first. It should.
> 
> Best Regards,
> 
> Jerry
> 
> Sent from my iPhone
> 
>> On 30 Dec, 2015, at 2:31 pm, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> 
>> Hi Jerry,
>> 
>> I want to run different jobs on different S3 buckets - different AWS 
>> creds - on the same instances. Could you shed some light if it's 
>> possible to achieve with hdfs-site?
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Wed, Dec 30, 2015 at 2:10 PM, Jerry Lam  wrote:
>>> Hi Kostiantyn,
>>> 
>>> Can you define those properties in hdfs-site.xml and make sure it is 
>>> visible in the class path when you spark-submit? It looks like a conf 
>>> sourcing issue to me. 
>>> 
>>> Cheers,
>>> 
>>> Sent from my iPhone
>>> 
 On 30 Dec, 2015, at 1:59 pm, KOSTIANTYN Kudriavtsev 
  wrote:
 
 Chris,
 
 thanks for the hist with AIM roles, but in my case  I need to run 
 different jobs with different S3 permissions on the same cluster, so 
 this approach doesn't work for me as far as I understood it
 
 Thank you,
 Konstantin Kudryavtsev
 
> On Wed, Dec 30, 2015 at 1:48 PM, Chris Fregly  
> wrote:
> couple things:
> 
> 1) switch to IAM roles if at all possible - explicitly passing AWS 
> credentials is a long and lonely road in the end
> 
> 2) one really bad workaround/hack is to run a job that hits every 
> worker and writes the credentials to the proper location 
> (~/.awscredentials or whatever)
> 
> ^^ i wouldn't recommend this. ^^  it's horrible and doesn't handle 
> autoscaling, but i'm mentioning it anyway as it is a temporary fix.
> 
> if you switch to IAM roles, things become a lot easier as you can 
> authorize all of the EC2 instances in the cluster - and handles 
> autoscaling very well - and at some point, you will want to autoscale.
> 
>> On Wed, Dec 30, 2015 at 1:08 PM, KOSTIANTYN Kudriavtsev 
>>  wrote:
>> Chris,
>> 
>>  good question, as you can see from the code I set up them on 
>> driver, so I expect they will be propagated to all nodes, won't them?
>> 
>> Thank you,
>> Konstantin Kudryavtsev
>> 
>>> On Wed, Dec 30, 2015 at 1:06 PM, Chris Fregly  
>>> wrote:
>>> are the credentials visible from each Worker node to all the 
>>> Executor JVMs on each Worker?
>>> 
 On Dec 30, 

Re: sparkR ORC support.

2016-01-06 Thread Sandeep Khurana
Felix

I tried the option suggested by you.  It gave below error.  I am going to
try the option suggested by Prem .

Error in writeJobj(con, object) : invalid jobj 1
8
stop("invalid jobj ", value$id)
7
writeJobj(con, object)
6
writeObject(con, a)
5
writeArgs(rc, args)
4
invokeJava(isStatic = TRUE, className, methodName, ...)
3
callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
source, options)
2
read.df(sqlContext, filepath, "orc") at
spark_api.R#108

On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
wrote:

> Firstly I don't have ORC data to verify but this should work:
>
> df <- loadDF(sqlContext, "data/path", "orc")
>
> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
> should be called after sparkR.init() - please check if there is any error
> message there.
>
> _
> From: Prem Sure 
> Sent: Tuesday, January 5, 2016 8:12 AM
> Subject: Re: sparkR ORC support.
> To: Sandeep Khurana 
> Cc: spark users , Deepak Sharma <
> deepakmc...@gmail.com>
>
>
>
> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>
>
> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
> wrote:
>
>> Also, do I need to setup hive in spark as per the link
>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>> ?
>>
>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>
>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>> wrote:
>>
>>> Deepak
>>>
>>> Tried this. Getting this error now
>>>
>>> rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
>>> unused argument ("")
>>>
>>>
>>> On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
>>> wrote:
>>>
 Hi Sandeep
 can you try this ?

 results <- sql(hivecontext, "FROM test SELECT id","")

 Thanks
 Deepak


 On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
 wrote:

> Thanks Deepak.
>
> I tried this as well. I created a hivecontext   with  "hivecontext <<-
> sparkRHive.init(sc) "  .
>
> When I tried to read hive table from this ,
>
> results <- sql(hivecontext, "FROM test SELECT id")
>
> I get below error,
>
> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
> SparkR was restarted, Spark operations need to be re-executed.
>
>
> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
> wrote:
>
>> Hi Sandeep
>> I am not sure if ORC can be read directly in R.
>> But there can be a workaround .First create hive table on top of ORC
>> files and then access hive table in R.
>>
>> Thanks
>> Deepak
>>
>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana > > wrote:
>>
>>> Hello
>>>
>>> I need to read an ORC files in hdfs in R using spark. I am not able
>>> to find a package to do that.
>>>
>>> Can anyone help with documentation or example for this purpose?
>>>
>>> --
>>> Architect
>>> Infoworks.io 
>>> http://Infoworks.io
>>>
>>
>>
>>
>> --
>> Thanks
>> Deepak
>> www.bigdatabig.com
>> www.keosha.net
>>
>
>
>
> --
> Architect
> Infoworks.io 
> http://Infoworks.io
>



 --
 Thanks
 Deepak
 www.bigdatabig.com
 www.keosha.net

>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io 
>>> http://Infoworks.io
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io 
>> http://Infoworks.io
>>
>
>
>
>


-- 
Architect
Infoworks.io
http://Infoworks.io


Re: sparkR ORC support.

2016-01-06 Thread Yanbo Liang
You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()

sqlContext <- sparkRHive.init(sc)


2016-01-06 20:35 GMT+08:00 Sandeep Khurana :

> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
> wrote:
>
>> Firstly I don't have ORC data to verify but this should work:
>>
>> df <- loadDF(sqlContext, "data/path", "orc")
>>
>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>> should be called after sparkR.init() - please check if there is any error
>> message there.
>>
>> _
>> From: Prem Sure 
>> Sent: Tuesday, January 5, 2016 8:12 AM
>> Subject: Re: sparkR ORC support.
>> To: Sandeep Khurana 
>> Cc: spark users , Deepak Sharma <
>> deepakmc...@gmail.com>
>>
>>
>>
>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>
>>
>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
>> wrote:
>>
>>> Also, do I need to setup hive in spark as per the link
>>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>>> ?
>>>
>>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>>
>>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>>> wrote:
>>>
 Deepak

 Tried this. Getting this error now

 rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
 unused argument ("")


 On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
 wrote:

> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
>> Thanks Deepak.
>>
>> I tried this as well. I created a hivecontext   with  "hivecontext
>> <<- sparkRHive.init(sc) "  .
>>
>> When I tried to read hive table from this ,
>>
>> results <- sql(hivecontext, "FROM test SELECT id")
>>
>> I get below error,
>>
>> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
>> SparkR was restarted, Spark operations need to be re-executed.
>>
>>
>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> I am not sure if ORC can be read directly in R.
>>> But there can be a workaround .First create hive table on top of ORC
>>> files and then access hive table in R.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana <
>>> sand...@infoworks.io> wrote:
>>>
 Hello

 I need to read an ORC files in hdfs in R using spark. I am not able
 to find a package to do that.

 Can anyone help with documentation or example for this purpose?

 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io 
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io 
>>> http://Infoworks.io
>>>
>>
>>
>>
>>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Context: Process data coming from Kafka and send back results to Kafka.

Issue: Each events could take several seconds to process (Work in progress
to improve that). During that time, events (and RDD) do accumulate.
Intermediate events (by key) do not have to be processed, only the last
ones. So when one process finished it would be ideal that Spark Streaming
skip all events that are not the current last ones (by key).

I'm not sure that the solution could be done using only Spark Streaming
API. As I understand Spark Streaming, DStream RDD will accumulate and be
processed one by one and do not considerate if there are others afterwards.

Possible solutions:

Using only Spark Streaming API but I'm not sure how. updateStateByKey seems
to be a solution. But I'm not sure that it will work properly when DStream
RDD accumulate and you have to only process lasts events by key.

Have two Spark Streaming pipelines. One to get last updated event by key,
store that in a map or a database. The second pipeline processes events
only if they are the last ones as indicate by the other pipeline.


Sub questions for the second solution:

Could two pipelines share the same sparkStreamingContext and process the
same DStream at different speed (low processing vs high)?

Is it easily possible to share values (map for example) between pipelines
without using an external database? I think accumulator/broadcast could
work but between two pipelines I'm not sure.

Regards,

Julien Naour


Re: How to accelerate reading json file?

2016-01-06 Thread Vijay Gharge
Hi all

I want to ask how exactly it differs while reading >1 tb file on standalone
cluster vs yarn or mesos cluster ?

On Wednesday 6 January 2016, Gavin Yue  wrote:

> I am trying to read json files following the example:
>
> val path = "examples/src/main/resources/jsonfile"val people = 
> sqlContext.read.json(path)
>
> I have 1 Tb size files in the path.  It took 1.2 hours to finish the reading 
> to infer the schema.
>
> But I already know the schema. Could I make this process short?
>
> Thanks a lot.
>
>
>
>

-- 
Regards,
Vijay Gharge


Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
Have you read

http://kafka.apache.org/documentation.html#compaction



On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour  wrote:

> Context: Process data coming from Kafka and send back results to Kafka.
>
> Issue: Each events could take several seconds to process (Work in progress
> to improve that). During that time, events (and RDD) do accumulate.
> Intermediate events (by key) do not have to be processed, only the last
> ones. So when one process finished it would be ideal that Spark Streaming
> skip all events that are not the current last ones (by key).
>
> I'm not sure that the solution could be done using only Spark Streaming
> API. As I understand Spark Streaming, DStream RDD will accumulate and be
> processed one by one and do not considerate if there are others afterwards.
>
> Possible solutions:
>
> Using only Spark Streaming API but I'm not sure how. updateStateByKey
> seems to be a solution. But I'm not sure that it will work properly when
> DStream RDD accumulate and you have to only process lasts events by key.
>
> Have two Spark Streaming pipelines. One to get last updated event by key,
> store that in a map or a database. The second pipeline processes events
> only if they are the last ones as indicate by the other pipeline.
>
>
> Sub questions for the second solution:
>
> Could two pipelines share the same sparkStreamingContext and process the
> same DStream at different speed (low processing vs high)?
>
> Is it easily possible to share values (map for example) between pipelines
> without using an external database? I think accumulator/broadcast could
> work but between two pipelines I'm not sure.
>
> Regards,
>
> Julien Naour
>


Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks for your answer,

As I understand it, a consumer that stays caught-up will read every message
even with compaction. So for a pure Kafka Spark Streaming It will not be a
solution.

Perhaps I could reconnect to the Kafka topic after each process to get the
last state of events and then compare to current Kafka Spark Streaming
events, but it seems a little tricky. For each event it will connect to
Kafka and get the current state by key (possibly lot of data) and then
compare to the current event. Latency could be an issue then.

To be more specific with my issue:

My events have specific keys corresponding to some kind of user id. I want
to process last events by each user id once ie skip intermediate events by
user id.
I have only one Kafka topic with all theses events.

Regards,

Julien Naour

Le mer. 6 janv. 2016 à 16:13, Cody Koeninger  a écrit :

> Have you read
>
> http://kafka.apache.org/documentation.html#compaction
>
>
>
> On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour  wrote:
>
>> Context: Process data coming from Kafka and send back results to Kafka.
>>
>> Issue: Each events could take several seconds to process (Work in
>> progress to improve that). During that time, events (and RDD) do
>> accumulate. Intermediate events (by key) do not have to be processed, only
>> the last ones. So when one process finished it would be ideal that Spark
>> Streaming skip all events that are not the current last ones (by key).
>>
>> I'm not sure that the solution could be done using only Spark Streaming
>> API. As I understand Spark Streaming, DStream RDD will accumulate and be
>> processed one by one and do not considerate if there are others afterwards.
>>
>> Possible solutions:
>>
>> Using only Spark Streaming API but I'm not sure how. updateStateByKey
>> seems to be a solution. But I'm not sure that it will work properly when
>> DStream RDD accumulate and you have to only process lasts events by key.
>>
>> Have two Spark Streaming pipelines. One to get last updated event by key,
>> store that in a map or a database. The second pipeline processes events
>> only if they are the last ones as indicate by the other pipeline.
>>
>>
>> Sub questions for the second solution:
>>
>> Could two pipelines share the same sparkStreamingContext and process the
>> same DStream at different speed (low processing vs high)?
>>
>> Is it easily possible to share values (map for example) between pipelines
>> without using an external database? I think accumulator/broadcast could
>> work but between two pipelines I'm not sure.
>>
>> Regards,
>>
>> Julien Naour
>>
>
>


Re: Spark on Apache Ingnite?

2016-01-06 Thread Ravi Kora
We have been using ignite on spark for one of our use cases. We are using 
Ignite’s SharedRDD feature. Following links should get you started in that 
direction. We have been using for the basic use case and works fine so far. 
There is not a whole lot of documentation on spark-ignite integration though. 
Some pain points that we observed are that it gives serialization errors when 
used on non-basic data types(UDTs etc.)

https://apacheignite.readme.io/docs/shared-rdd
https://apacheignite.readme.io/docs/testing-integration-with-spark-shell

-Ravi

From: Umesh Kacha >
Date: Tuesday, January 5, 2016 at 11:47 PM
To: "n...@reactor8.com" 
>
Cc: "user@spark.apache.org" 
>
Subject: RE: Spark on Apache Ingnite?


Hi  Nate thanks much. I have exact same use cases mentioned by you. My spark 
job does heavy writing involving  group by and huge data shuffling. Can you 
please provide any pointer how can I run my existing spark job which is running 
on yarn to make it run on ignite? Please guide. Thanks again.

On Jan 6, 2016 02:28, > wrote:
We started playing with Ignite back Hadoop, hive and spark services, and
looking to move to it as our default for deployment going forward, still
early but so far its been pretty nice and excited for the flexibility it
will provide for our particular use cases.

Would say in general its worth looking into if your data workloads are:

a) mix of read/write, or heavy write at times
b) want write/read access to data from services/apps outside of your spark
workloads (old Hadoop jobs, custom apps, etc)
c) have strings of spark jobs that could benefit from caching your data
across them (think similar usage to tachyon)
d) you have sparksql queries that could benefit from indexing and mutability
(see pt (a) about mix read/write)

If your data is read exclusive and very batch oriented, and your workloads
are strictly spark based, benefits will be less and ignite would probably
act as more of a tachyon replacement as many of the other features outside
of RDD caching wont be leveraged.


-Original Message-
From: unk1102 [mailto:umesh.ka...@gmail.com]
Sent: Tuesday, January 5, 2016 10:15 AM
To: user@spark.apache.org
Subject: Spark on Apache Ingnite?

Hi has anybody tried and had success with Spark on Apache Ignite seems
promising? https://ignite.apache.org/



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Apache-Ingnite-
tp25884.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org For 
additional
commands, e-mail: user-h...@spark.apache.org




fp growth - clean up repetitions in input

2016-01-06 Thread matd
Hi folks,

I'm interested in using FP growth to identify sequence patterns.

Unfortunately, my input sequences have cycles :
...1,2,4,1,2,5...

And this is not supported by fp-growth
(I get a SparkException: Items in a transaction must be unique but got
WrappedArray)

Do you know a way to identify and clean up cycles before giving them to
fp-growth ?

thanks for your input.
Mat



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/fp-growth-clean-up-repetitions-in-input-tp25897.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread unk1102
Hi As part of Spark 1.6 release what should be ideal value or unit for 
spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it
correct? Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: [Spark-SQL] Custom aggregate function for GrouppedData

2016-01-06 Thread Michael Armbrust
In Spark 1.6 GroupedDataset

has
mapGroups, which sounds like what you are looking for.  You can also write
a custom Aggregator


On Tue, Jan 5, 2016 at 8:14 PM, Abhishek Gayakwad 
wrote:

> Hello Hivemind,
>
> Referring to this thread -
> https://forums.databricks.com/questions/956/how-do-i-group-my-dataset-by-a-key-or-combination.html.
> I have learnt that we can not do much with groupped data apart from using
> existing aggregate functions. This blog post was written in may 2015, I
> don't know if things are changes from that point of time. I am using 1.4
> version of spark.
>
> What I am trying to achieve is something very similar to collectset in
> hive (actually unique ordered concated values.) e.g.
>
> 1,2
> 1,3
> 2,4
> 2,5
> 2,4
>
> to
> 1, "2,3"
> 2, "4,5"
>
> Currently I am achieving this by converting dataframe to RDD, do the
> required operations and convert it back to dataframe as shown below.
>
> public class AvailableSizes implements Serializable {
>
> public DataFrame calculate(SQLContext ssc, DataFrame salesDataFrame) {
> final JavaRDD rowJavaRDD = salesDataFrame.toJavaRDD();
>
> JavaPairRDD pairs = rowJavaRDD.mapToPair(
> (PairFunction) row -> {
> final Object[] objects = {row.getAs(0), row.getAs(1), 
> row.getAs(3)};
> return new Tuple2<>(row.getAs(SalesColumns.STYLE.name()), 
> new GenericRowWithSchema(objects, SalesColumns.getOutputSchema()));
> });
>
> JavaPairRDD withSizeList = pairs.reduceByKey(new 
> Function2() {
> @Override
> public Row call(Row aRow, Row bRow) {
> final String uniqueCommaSeparatedSizes = uniqueSizes(aRow, 
> bRow);
> final Object[] objects = {aRow.getAs(0), aRow.getAs(1), 
> uniqueCommaSeparatedSizes};
> return new GenericRowWithSchema(objects, 
> SalesColumns.getOutputSchema());
> }
>
> private String uniqueSizes(Row aRow, Row bRow) {
> final SortedSet allSizes = new TreeSet<>();
> final List aSizes = Arrays.asList(((String) 
> aRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
> final List bSizes = Arrays.asList(((String) 
> bRow.getAs(String.valueOf(SalesColumns.SIZE))).split(","));
> allSizes.addAll(aSizes);
> allSizes.addAll(bSizes);
> return csvFormat(allSizes);
> }
> });
>
> final JavaRDD values = withSizeList.values();
>
> return ssc.createDataFrame(values, SalesColumns.getOutputSchema());
>
> }
>
> public String csvFormat(Collection collection) {
> return 
> collection.stream().map(Object::toString).collect(Collectors.joining(","));
> }
> }
>
> Please suggest if there is a better way of doing this.
>
> Regards,
> Abhishek
>


Why is this job running since one hour?

2016-01-06 Thread unk1102
Hi I have one main Spark job which spawns multiple child spark jobs. One of
the child spark job is running for an hour and it keeps on hanging there I
have taken snap shot please see

 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-this-job-running-since-one-hour-tp25899.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem building spark on centos

2016-01-06 Thread Ted Yu
w.r.t. the second error, have you read this ?
http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo

On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu  wrote:

> I’m using 3.3.9. Thanks!
>
> Jade
>
> From: Ted Yu 
> Date: Tuesday, January 5, 2016 at 4:57 PM
> To: Jade Liu 
> Cc: "user@spark.apache.org" 
> Subject: Re: problem building spark on centos
>
> Which version of maven are you using ?
>
> It should be 3.3.3+
>
> On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu  wrote:
>
>> Hi, All:
>>
>> I’m trying to build spark 1.5.2 from source using maven with the
>> following command:
>>
>> ./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
>> -Dscala-2.11 -Phive -Phive-thriftserver –DskipTests
>>
>>
>> I got the following error:I'
>>
>> + VERSION='[ERROR] [Help 2]
>> http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException
>> '
>>
>>
>> When I try:
>>
>> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
>> package
>>
>>
>> I got the following error:
>>
>> [FATAL] Non-resolvable parent POM for
>> org.apache.spark:spark-parent_2.11:1.5.2: Could not transfer artifact
>> org.apache:apache:pom:14 from/to central (https://repo1.maven.org/maven2):
>> java.security.ProviderException: java.security.KeyException and
>> 'parent.relativePath' points at wrong local POM @ line 22, column 11
>>
>>
>> Does anyone know how to change the settings in maven to fix this?
>>
>>
>> Thanks in advance,
>>
>>
>> Jade
>>
>
>


Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Micheal

I really appreciate your help. I The following code works. Is there a way
this example can be added to the distribution to make it easier for future
java programmers? It look me a long time get to this simple solution.

I'll need to tweak this example a little to work with the new PipeLine save
functionality. We need the current sqlContext to register our UDF. I see if
I can pass this in the Param Map. I¹ll throw and exception is some one use
transform(df) 

public class StemmerTransformer extends Transformer implements Serializable
{

   void registerUDF() {

if (udf == null) {

udf = new UDF();

DataType returnType =
DataTypes.createArrayType(DataTypes.StringType);

sqlContext.udf().register(udfName, udf, returnType);

}

}



   @Override

public DataFrame transform(DataFrame df) {

df.printSchema();

df.show();



registerUDF();



DataFrame ret = df.selectExpr("*", "StemUDF(rawInput) as
filteredOutput");

return ret;

}



   class UDF implements UDF1 {

private static final long serialVersionUID = 1L;



@Override

public List call(WrappedArray wordsArg) throws
Exception {

List words = JavaConversions.asJavaList(wordsArg);

ArrayList ret = new ArrayList(words.size());

for (String word : words) {

// TODO replace test code

ret.add(word + "_stemed");

}

   

return ret;

}

}

}



root

 |-- rawInput: array (nullable = false)

 ||-- element: string (containsNull = true)



++

|rawInput|

++

|[I, saw, the, red...|

|[Mary, had, a, li...|

|[greet, greeting,...|

++



root

 |-- rawInput: array (nullable = false)

 ||-- element: string (containsNull = true)

 |-- filteredOutput: array (nullable = true)

 ||-- element: string (containsNull = true)



+--+
---+

|rawInput  |filteredOutput
|

+--+
---+

|[I, saw, the, red, baloon]|[I_stemed, saw_stemed, the_stemed,
red_stemed, baloon_stemed]  |

|[Mary, had, a, little, lamb]  |[Mary_stemed, had_stemed, a_stemed,
little_stemed, lamb_stemed]|

|[greet, greeting, greets, greeted]|[greet_stemed, greeting_stemed,
greets_stemed, greeted_stemed] |

+--+
---+



From:  Michael Armbrust 
Date:  Tuesday, January 5, 2016 at 12:58 PM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: problem with DataFrame df.withColumn()
org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

>> I am trying to implement  org.apache.spark.ml 
>> .Transformer interface in Java 8.
>> My understanding is the sudo code for transformers is something like
>> @Override
>> 
>> public DataFrame transform(DataFrame df) {
>> 
>> 1. Select the input column
>> 
>> 2. Create a new column
>> 
>> 3. Append the new column to the df argument and return
>> 
>>}
> 
> 
> The following line can be used inside of the transform function to return a
> Dataframe that has been augmented with a new column using the stem lambda
> function (defined as a UDF below).
> return df.withColumn("filteredInput", expr("stem(rawInput)"));
> This is producing a new column called filterInput (that is appended to
> whatever columns are already there) by passing the column rawInput to your
> arbitrary lambda function.
>  
>> Based on my experience the current DataFrame api is very limited. You can not
>> apply a complicated lambda function. As a work around I convert the data
>> frame to a JavaRDD, apply my complicated lambda, and then convert the
>> resulting RDD back to a Data Frame.
> 
> 
> This is exactly what this code is doing.  You are defining an arbitrary lambda
> function as a UDF.  The difference here, when compared to a JavaRDD map, is
> that you can use this UDF to append columns without having to manually append
> the new data to some existing object.
> sqlContext.udf().register("stem", new UDF1() {
>   @Override
>   public String call(String str) {
> return // TODO: stemming code here
>   }
> }, DataTypes.StringType);
>> Now I select the ³new column² from the Data Frame and try to call
>> df.withColumn().
>> 
>> 
>> 
>> I can try an implement this as a UDF. How ever I need to use several 3rd
>> party jars. Any idea how insure the workers will have the required jar files?
>> If I was submitting a normal java app I would create an uber jar will this
>> work with UDFs?
> 
> 
> Yeah, UDFs are run the same way 

Re: How to insert df in HBASE

2016-01-06 Thread Ted Yu
Cycling prior discussion:

http://search-hadoop.com/m/q3RTtX7POh17hqdj1

On Wed, Jan 6, 2016 at 3:07 AM, Sadaf  wrote:

> HI,
>
> I need to insert a Dataframe in to hbase using scala code.
> Can anyone guide me how to achieve this?
>
> Any help would be much appreciated.
> Thanks
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-insert-df-in-HBASE-tp25891.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Predictive Modelling in sparkR

2016-01-06 Thread Chandan Verma
Has anyone tried building logistic regression model in SparkR.. Is it
recommended?  Does it take longer to do process than what can be done in
simple R?

 




===
DISCLAIMER:
The information contained in this message (including any attachments) is 
confidential and may be privileged. If you have received it by mistake please 
notify the sender by return e-mail and permanently delete this message and any 
attachments from your system. Any dissemination, use, review, distribution, 
printing or copying of this message in whole or in part is strictly prohibited. 
Please note that e-mails are susceptible to change. CitiusTech shall not be 
liable for the improper or incomplete transmission of the information contained 
in this communication nor for any delay in its receipt or damage to your 
system. CitiusTech does not guarantee that the integrity of this communication 
has been maintained or that this communication is free of viruses, 
interceptions or interferences. 



Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
>
> I really appreciate your help. I The following code works.
>

Glad you got it to work!

Is there a way this example can be added to the distribution to make it
> easier for future java programmers? It look me a long time get to this
> simple solution.
>

I'd welcome a pull request that added UDFs to the programming guide section
on dataframes:
http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations


Re: problem building spark on centos

2016-01-06 Thread Jade Liu
I’ve changed the scala version to 2.10.

With this command:
build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean 
package
Build was successful.

But make a runnable version:
/make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0  -Phive 
-Phive-thriftserver –DskipTests
Still fails with the following error:
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
Caused by: Compile failed via zinc server
at sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)

Not sure what’s causing it. Does anyone have any idea?

Thanks!

Jade
From: Ted Yu >
Date: Wednesday, January 6, 2016 at 10:40 AM
To: Jade Liu >, user 
>
Subject: Re: problem building spark on centos

w.r.t. the second error, have you read this ?
http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo

On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu 
> wrote:
I’m using 3.3.9. Thanks!

Jade

From: Ted Yu >
Date: Tuesday, January 5, 2016 at 4:57 PM
To: Jade Liu >
Cc: "user@spark.apache.org" 
>
Subject: Re: problem building spark on centos

Which version of maven are you using ?

It should be 3.3.3+

On Tue, Jan 5, 2016 at 4:54 PM, Jade Liu 
> wrote:
Hi, All:

I’m trying to build spark 1.5.2 from source using maven with the following 
command:
./make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0 
-Dscala-2.11 -Phive -Phive-thriftserver –DskipTests

I got the following error:I'
+ VERSION='[ERROR] [Help 2] 
http://cwiki.apache.org/confluence/display/MAVEN/UnresolvableModelException'

When I try:

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Wei Chen
Thank you. I have tried the window function as follows:

import pyspark.sql.functions as f
sqc = sqlContext
from pyspark.sql import Window
import pandas as pd

DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
   'b': [1,2,3,1,2,3,1,2,3],
   'c': [1,2,3,4,5,6,7,8,9]
  })

df = sqc.createDataFrame(DF)

window = Window.partitionBy("a").orderBy("c")

df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()

I got the following result which is understandable:

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  1|
|  1|  3|  3|  1|
|  2|  1|  4|  4|
|  2|  2|  5|  4|
|  2|  3|  6|  4|
|  3|  1|  7|  7|
|  3|  2|  8|  7|
|  3|  3|  9|  7|
+---+---+---+---+


However if I change min to max, the result is not what is expected:

df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives

+---+---+---+---+
|  a|  b|  c|  y|
+---+---+---+---+
|  1|  1|  1|  1|
|  1|  2|  2|  2|
|  1|  3|  3|  3|
|  2|  1|  4|  4|
|  2|  2|  5|  5|
|  2|  3|  6|  6|
|  3|  1|  7|  7|
|  3|  2|  8|  8|
|  3|  3|  9|  9|
+---+---+---+---+



Thanks,

Wei


On Tue, Jan 5, 2016 at 8:30 PM, ayan guha  wrote:

> Yes there is. It is called window function over partitions.
>
> Equivalent SQL would be:
>
> select * from
>  (select a,b,c, rank() over (partition by a order by b) r from df)
> x
> where r = 1
>
> You can register your DF as a temp table and use the sql form. Or, (>Spark
> 1.4) you can use window methods and their variants in Spark SQL module.
>
> HTH
>
> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen 
> wrote:
>
>> Hi,
>>
>> I am trying to retrieve the rows with a minimum value of a column for
>> each group. For example: the following dataframe:
>>
>> a | b | c
>> --
>> 1 | 1 | 1
>> 1 | 2 | 2
>> 1 | 3 | 3
>> 2 | 1 | 4
>> 2 | 2 | 5
>> 2 | 3 | 6
>> 3 | 1 | 7
>> 3 | 2 | 8
>> 3 | 3 | 9
>> --
>>
>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>> to return the following dataframe:
>>
>> a | b | c
>> --
>> 1 | 1 | 1
>> 2 | 1 | 4
>> 3 | 1 | 7
>> --
>>
>> The dataframe I have is huge so get the minimum value of b from each
>> group and joining on the original dataframe is very expensive. Is there a
>> better way to do this?
>>
>>
>> Thanks,
>> Wei
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Wei Chen, Ph.D.
Astronomer and Data Scientist
Phone: (832)646-7124
Email: wei.chen.ri...@gmail.com
LinkedIn: https://www.linkedin.com/in/weichen1984


Re: spark 1.6 Issue

2016-01-06 Thread Mark Hamstra
It's not a bug, but a larger heap is required with the new
UnifiedMemoryManager:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172

On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com <
kali.tumm...@gmail.com> wrote:

> Hi All,
>
> I am running my app in IntelliJ Idea (locally) my config local[*] , the
> code
> worked ok with spark 1.5 but when I upgraded to 1.6 I am having below
> issue.
>
> is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error
> do I need to pass executor memory while running in local in spark 1.6 ?
>
> Exception in thread "main" java.lang.IllegalArgumentException: System
> memory
> 259522560 must be at least 4.718592E8. Please use a larger heap size.
>
> Thanks
> Sri
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark 1.6 Issue

2016-01-06 Thread Sri
Hi Mark,

I did changes to VM options in edit configuration section for the main method 
and Scala test case class in IntelliJ which worked ok when I executed 
individually, but while running maven install to create jar file the test case 
is failing.
Can I add VM options in spark conf set in Scala test class hard coded way?

Thanks
Sri

Sent from my iPhone

> On 6 Jan 2016, at 17:43, Mark Hamstra  wrote:
> 
> It's not a bug, but a larger heap is required with the new 
> UnifiedMemoryManager: 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/memory/UnifiedMemoryManager.scala#L172
> 
>> On Wed, Jan 6, 2016 at 6:35 AM, kali.tumm...@gmail.com 
>>  wrote:
>> Hi All,
>> 
>> I am running my app in IntelliJ Idea (locally) my config local[*] , the code
>> worked ok with spark 1.5 but when I upgraded to 1.6 I am having below issue.
>> 
>> is this a bug in 1.6 ? I change back to 1.5 it worked ok without any error
>> do I need to pass executor memory while running in local in spark 1.6 ?
>> 
>> Exception in thread "main" java.lang.IllegalArgumentException: System memory
>> 259522560 must be at least 4.718592E8. Please use a larger heap size.
>> 
>> Thanks
>> Sri
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/spark-1-6-Issue-tp25893.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
> 


When to use streaming state and when an external storage?

2016-01-06 Thread Rado Buranský
What are pros/cons and general idea behind state in Spark Streaming? By
state I mean state created by "mapWithState" (or updateStateByKey).

When to use it and when not? Is it a good idea to accumulate a state in
jobs running continuously years?

Example: Remember IP adresses of returning visitors. Key is an IP address
and state is a boolean set to true if we have seen the same IP before.
Let's start the job now and let it run until forever.

What happens to the state if we stop and then start the app? When can we
lose the state and never be able to recover it?

Too many question, I know.

Thanks

Rado


Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
if you don't have hot users, you can use the user id as the hash key for
publishing into kafka.
That will put all events for a given user in the same partition per batch.
Then you can do foreachPartition with a local map to store just a single
event per user, e.g.

foreachPartition { p =>
  val m = new HashMap
  p.foreach ( event =>
m.put(event,user, event)
  }
  m.foreach {
   ... do your computation
 }

On Wed, Jan 6, 2016 at 10:09 AM, Julien Naour  wrote:

> Thanks for your answer,
>
> As I understand it, a consumer that stays caught-up will read every
> message even with compaction. So for a pure Kafka Spark Streaming It will
> not be a solution.
>
> Perhaps I could reconnect to the Kafka topic after each process to get the
> last state of events and then compare to current Kafka Spark Streaming
> events, but it seems a little tricky. For each event it will connect to
> Kafka and get the current state by key (possibly lot of data) and then
> compare to the current event. Latency could be an issue then.
>
> To be more specific with my issue:
>
> My events have specific keys corresponding to some kind of user id. I want
> to process last events by each user id once ie skip intermediate events by
> user id.
> I have only one Kafka topic with all theses events.
>
> Regards,
>
> Julien Naour
>
> Le mer. 6 janv. 2016 à 16:13, Cody Koeninger  a
> écrit :
>
>> Have you read
>>
>> http://kafka.apache.org/documentation.html#compaction
>>
>>
>>
>> On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour  wrote:
>>
>>> Context: Process data coming from Kafka and send back results to Kafka.
>>>
>>> Issue: Each events could take several seconds to process (Work in
>>> progress to improve that). During that time, events (and RDD) do
>>> accumulate. Intermediate events (by key) do not have to be processed, only
>>> the last ones. So when one process finished it would be ideal that Spark
>>> Streaming skip all events that are not the current last ones (by key).
>>>
>>> I'm not sure that the solution could be done using only Spark Streaming
>>> API. As I understand Spark Streaming, DStream RDD will accumulate and be
>>> processed one by one and do not considerate if there are others afterwards.
>>>
>>> Possible solutions:
>>>
>>> Using only Spark Streaming API but I'm not sure how. updateStateByKey
>>> seems to be a solution. But I'm not sure that it will work properly when
>>> DStream RDD accumulate and you have to only process lasts events by key.
>>>
>>> Have two Spark Streaming pipelines. One to get last updated event by
>>> key, store that in a map or a database. The second pipeline processes
>>> events only if they are the last ones as indicate by the other pipeline.
>>>
>>>
>>> Sub questions for the second solution:
>>>
>>> Could two pipelines share the same sparkStreamingContext and process the
>>> same DStream at different speed (low processing vs high)?
>>>
>>> Is it easily possible to share values (map for example) between
>>> pipelines without using an external database? I think accumulator/broadcast
>>> could work but between two pipelines I'm not sure.
>>>
>>> Regards,
>>>
>>> Julien Naour
>>>
>>
>>


Re: sparkR ORC support.

2016-01-06 Thread Felix Cheung
Yes, as Yanbo suggested, it looks like there is something wrong with the 
sqlContext.
Could you forward us your code please?






On Wed, Jan 6, 2016 at 5:52 AM -0800, "Yanbo Liang"  wrote:





You should ensure your sqlContext is HiveContext.

sc <- sparkR.init()

sqlContext <- sparkRHive.init(sc)


2016-01-06 20:35 GMT+08:00 Sandeep Khurana :

> Felix
>
> I tried the option suggested by you.  It gave below error.  I am going to
> try the option suggested by Prem .
>
> Error in writeJobj(con, object) : invalid jobj 1
> 8
> stop("invalid jobj ", value$id)
> 7
> writeJobj(con, object)
> 6
> writeObject(con, a)
> 5
> writeArgs(rc, args)
> 4
> invokeJava(isStatic = TRUE, className, methodName, ...)
> 3
> callJStatic("org.apache.spark.sql.api.r.SQLUtils", "loadDF", sqlContext,
> source, options)
> 2
> read.df(sqlContext, filepath, "orc") at
> spark_api.R#108
>
> On Wed, Jan 6, 2016 at 10:30 AM, Felix Cheung 
> wrote:
>
>> Firstly I don't have ORC data to verify but this should work:
>>
>> df <- loadDF(sqlContext, "data/path", "orc")
>>
>> Secondly, could you check if sparkR.stop() was called? sparkRHive.init()
>> should be called after sparkR.init() - please check if there is any error
>> message there.
>>
>> _
>> From: Prem Sure 
>> Sent: Tuesday, January 5, 2016 8:12 AM
>> Subject: Re: sparkR ORC support.
>> To: Sandeep Khurana 
>> Cc: spark users , Deepak Sharma <
>> deepakmc...@gmail.com>
>>
>>
>>
>> Yes Sandeep, also copy hive-site.xml too to spark conf directory.
>>
>>
>> On Tue, Jan 5, 2016 at 10:07 AM, Sandeep Khurana 
>> wrote:
>>
>>> Also, do I need to setup hive in spark as per the link
>>> http://stackoverflow.com/questions/26360725/accesing-hive-tables-in-spark
>>> ?
>>>
>>> We might need to copy hdfs-site.xml file to spark conf directory ?
>>>
>>> On Tue, Jan 5, 2016 at 8:28 PM, Sandeep Khurana 
>>> wrote:
>>>
 Deepak

 Tried this. Getting this error now

 rror in sql(hivecontext, "FROM CATEGORIES SELECT category_id", "") :   
 unused argument ("")


 On Tue, Jan 5, 2016 at 6:48 PM, Deepak Sharma 
 wrote:

> Hi Sandeep
> can you try this ?
>
> results <- sql(hivecontext, "FROM test SELECT id","")
>
> Thanks
> Deepak
>
>
> On Tue, Jan 5, 2016 at 5:49 PM, Sandeep Khurana 
> wrote:
>
>> Thanks Deepak.
>>
>> I tried this as well. I created a hivecontext   with  "hivecontext
>> <<- sparkRHive.init(sc) "  .
>>
>> When I tried to read hive table from this ,
>>
>> results <- sql(hivecontext, "FROM test SELECT id")
>>
>> I get below error,
>>
>> Error in callJMethod(sqlContext, "sql", sqlQuery) :   Invalid jobj 2. If 
>> SparkR was restarted, Spark operations need to be re-executed.
>>
>>
>> Not sure what is causing this? Any leads or ideas? I am using rstudio.
>>
>>
>>
>> On Tue, Jan 5, 2016 at 5:35 PM, Deepak Sharma 
>> wrote:
>>
>>> Hi Sandeep
>>> I am not sure if ORC can be read directly in R.
>>> But there can be a workaround .First create hive table on top of ORC
>>> files and then access hive table in R.
>>>
>>> Thanks
>>> Deepak
>>>
>>> On Tue, Jan 5, 2016 at 4:57 PM, Sandeep Khurana <
>>> sand...@infoworks.io> wrote:
>>>
 Hello

 I need to read an ORC files in hdfs in R using spark. I am not able
 to find a package to do that.

 Can anyone help with documentation or example for this purpose?

 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Thanks
>>> Deepak
>>> www.bigdatabig.com
>>> www.keosha.net
>>>
>>
>>
>>
>> --
>> Architect
>> Infoworks.io 
>> http://Infoworks.io
>>
>
>
>
> --
> Thanks
> Deepak
> www.bigdatabig.com
> www.keosha.net
>



 --
 Architect
 Infoworks.io 
 http://Infoworks.io

>>>
>>>
>>>
>>> --
>>> Architect
>>> Infoworks.io 
>>> http://Infoworks.io
>>>
>>
>>
>>
>>
>
>
> --
> Architect
> Infoworks.io
> http://Infoworks.io
>


Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
Thanks Cody again for your answer.

The idea here is to process all events but only launch the big job (that is
longer than the batch size) if they are the last events for an id
considering the current state of data. Knowing if they are the last is my
issue in fact.

So I think I need two jobs. One to get a view of theses last events by id.
And one that compare the events in the DStream to the view of last events
and launch the big job only if needed.

It is some kind of a trick to not accumulate latency. I know the main issue
come from the fact that my job is too long, but it could be useful to have
this feature to get the time to handle the main issue that will be more
time consuming (changes in architecture are probably needed).

Le mer. 6 janv. 2016 à 18:06, Cody Koeninger  a écrit :

> If your job consistently takes longer than the batch time to process, you
> will keep lagging longer and longer behind.  That's not sustainable, you
> need to increase batch sizes or decrease processing time.  In your case,
> probably increase batch size, since you're pre-filtering it down to only 1
> event per user.
>
> There's no way around this, you can't just magically ignore some time
> range of rdds, because they may contain events you care about.
>
> On Wed, Jan 6, 2016 at 10:55 AM, Julien Naour  wrote:
>
>> The following lines are my understanding of Spark Streaming AFAIK, I
>> could be wrong:
>>
>> Spark Streaming processes data from a Stream in micro-batch, one at a
>> time.
>> When a process takes time, DStream's RDD are accumulated.
>> So in my case (my process takes time) DStream's RDD are accumulated. What
>> I want to do is to skip all intermediate events in this queue of RDD.
>>
>> I'm not sure to be right but with your solution I'll get the last events
>> by id for each RDD in the DStream. What I want is to process the following
>> event that is the last events for an id for the DStream's RDD that are
>> accumulated.
>>
>> Regards,
>>
>> Julien
>>
>> Le mer. 6 janv. 2016 à 17:35, Cody Koeninger  a
>> écrit :
>>
>>> if you don't have hot users, you can use the user id as the hash key for
>>> publishing into kafka.
>>> That will put all events for a given user in the same partition per
>>> batch.
>>> Then you can do foreachPartition with a local map to store just a single
>>> event per user, e.g.
>>>
>>> foreachPartition { p =>
>>>   val m = new HashMap
>>>   p.foreach ( event =>
>>> m.put(event,user, event)
>>>   }
>>>   m.foreach {
>>>... do your computation
>>>  }
>>>
>>> On Wed, Jan 6, 2016 at 10:09 AM, Julien Naour 
>>> wrote:
>>>
 Thanks for your answer,

 As I understand it, a consumer that stays caught-up will read every
 message even with compaction. So for a pure Kafka Spark Streaming It will
 not be a solution.

 Perhaps I could reconnect to the Kafka topic after each process to get
 the last state of events and then compare to current Kafka Spark Streaming
 events, but it seems a little tricky. For each event it will connect to
 Kafka and get the current state by key (possibly lot of data) and then
 compare to the current event. Latency could be an issue then.

 To be more specific with my issue:

 My events have specific keys corresponding to some kind of user id. I
 want to process last events by each user id once ie skip intermediate
 events by user id.
 I have only one Kafka topic with all theses events.

 Regards,

 Julien Naour

 Le mer. 6 janv. 2016 à 16:13, Cody Koeninger  a
 écrit :

> Have you read
>
> http://kafka.apache.org/documentation.html#compaction
>
>
>
> On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour 
> wrote:
>
>> Context: Process data coming from Kafka and send back results to
>> Kafka.
>>
>> Issue: Each events could take several seconds to process (Work in
>> progress to improve that). During that time, events (and RDD) do
>> accumulate. Intermediate events (by key) do not have to be processed, 
>> only
>> the last ones. So when one process finished it would be ideal that Spark
>> Streaming skip all events that are not the current last ones (by key).
>>
>> I'm not sure that the solution could be done using only Spark
>> Streaming API. As I understand Spark Streaming, DStream RDD will 
>> accumulate
>> and be processed one by one and do not considerate if there are others
>> afterwards.
>>
>> Possible solutions:
>>
>> Using only Spark Streaming API but I'm not sure how. updateStateByKey
>> seems to be a solution. But I'm not sure that it will work properly when
>> DStream RDD accumulate and you have to only process lasts events by key.
>>
>> Have two Spark Streaming pipelines. One to get last updated event by
>> key, 

Re: Spark Streaming: process only last events

2016-01-06 Thread Julien Naour
The following lines are my understanding of Spark Streaming AFAIK, I could
be wrong:

Spark Streaming processes data from a Stream in micro-batch, one at a time.
When a process takes time, DStream's RDD are accumulated.
So in my case (my process takes time) DStream's RDD are accumulated. What I
want to do is to skip all intermediate events in this queue of RDD.

I'm not sure to be right but with your solution I'll get the last events by
id for each RDD in the DStream. What I want is to process the following
event that is the last events for an id for the DStream's RDD that are
accumulated.

Regards,

Julien

Le mer. 6 janv. 2016 à 17:35, Cody Koeninger  a écrit :

> if you don't have hot users, you can use the user id as the hash key for
> publishing into kafka.
> That will put all events for a given user in the same partition per batch.
> Then you can do foreachPartition with a local map to store just a single
> event per user, e.g.
>
> foreachPartition { p =>
>   val m = new HashMap
>   p.foreach ( event =>
> m.put(event,user, event)
>   }
>   m.foreach {
>... do your computation
>  }
>
> On Wed, Jan 6, 2016 at 10:09 AM, Julien Naour  wrote:
>
>> Thanks for your answer,
>>
>> As I understand it, a consumer that stays caught-up will read every
>> message even with compaction. So for a pure Kafka Spark Streaming It will
>> not be a solution.
>>
>> Perhaps I could reconnect to the Kafka topic after each process to get
>> the last state of events and then compare to current Kafka Spark Streaming
>> events, but it seems a little tricky. For each event it will connect to
>> Kafka and get the current state by key (possibly lot of data) and then
>> compare to the current event. Latency could be an issue then.
>>
>> To be more specific with my issue:
>>
>> My events have specific keys corresponding to some kind of user id. I
>> want to process last events by each user id once ie skip intermediate
>> events by user id.
>> I have only one Kafka topic with all theses events.
>>
>> Regards,
>>
>> Julien Naour
>>
>> Le mer. 6 janv. 2016 à 16:13, Cody Koeninger  a
>> écrit :
>>
>>> Have you read
>>>
>>> http://kafka.apache.org/documentation.html#compaction
>>>
>>>
>>>
>>> On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour  wrote:
>>>
 Context: Process data coming from Kafka and send back results to Kafka.

 Issue: Each events could take several seconds to process (Work in
 progress to improve that). During that time, events (and RDD) do
 accumulate. Intermediate events (by key) do not have to be processed, only
 the last ones. So when one process finished it would be ideal that Spark
 Streaming skip all events that are not the current last ones (by key).

 I'm not sure that the solution could be done using only Spark Streaming
 API. As I understand Spark Streaming, DStream RDD will accumulate and be
 processed one by one and do not considerate if there are others afterwards.

 Possible solutions:

 Using only Spark Streaming API but I'm not sure how. updateStateByKey
 seems to be a solution. But I'm not sure that it will work properly when
 DStream RDD accumulate and you have to only process lasts events by key.

 Have two Spark Streaming pipelines. One to get last updated event by
 key, store that in a map or a database. The second pipeline processes
 events only if they are the last ones as indicate by the other pipeline.


 Sub questions for the second solution:

 Could two pipelines share the same sparkStreamingContext and process
 the same DStream at different speed (low processing vs high)?

 Is it easily possible to share values (map for example) between
 pipelines without using an external database? I think accumulator/broadcast
 could work but between two pipelines I'm not sure.

 Regards,

 Julien Naour

>>>
>>>
>


Re: Spark Streaming: process only last events

2016-01-06 Thread Cody Koeninger
If your job consistently takes longer than the batch time to process, you
will keep lagging longer and longer behind.  That's not sustainable, you
need to increase batch sizes or decrease processing time.  In your case,
probably increase batch size, since you're pre-filtering it down to only 1
event per user.

There's no way around this, you can't just magically ignore some time range
of rdds, because they may contain events you care about.

On Wed, Jan 6, 2016 at 10:55 AM, Julien Naour  wrote:

> The following lines are my understanding of Spark Streaming AFAIK, I could
> be wrong:
>
> Spark Streaming processes data from a Stream in micro-batch, one at a time.
> When a process takes time, DStream's RDD are accumulated.
> So in my case (my process takes time) DStream's RDD are accumulated. What
> I want to do is to skip all intermediate events in this queue of RDD.
>
> I'm not sure to be right but with your solution I'll get the last events
> by id for each RDD in the DStream. What I want is to process the following
> event that is the last events for an id for the DStream's RDD that are
> accumulated.
>
> Regards,
>
> Julien
>
> Le mer. 6 janv. 2016 à 17:35, Cody Koeninger  a
> écrit :
>
>> if you don't have hot users, you can use the user id as the hash key for
>> publishing into kafka.
>> That will put all events for a given user in the same partition per batch.
>> Then you can do foreachPartition with a local map to store just a single
>> event per user, e.g.
>>
>> foreachPartition { p =>
>>   val m = new HashMap
>>   p.foreach ( event =>
>> m.put(event,user, event)
>>   }
>>   m.foreach {
>>... do your computation
>>  }
>>
>> On Wed, Jan 6, 2016 at 10:09 AM, Julien Naour  wrote:
>>
>>> Thanks for your answer,
>>>
>>> As I understand it, a consumer that stays caught-up will read every
>>> message even with compaction. So for a pure Kafka Spark Streaming It will
>>> not be a solution.
>>>
>>> Perhaps I could reconnect to the Kafka topic after each process to get
>>> the last state of events and then compare to current Kafka Spark Streaming
>>> events, but it seems a little tricky. For each event it will connect to
>>> Kafka and get the current state by key (possibly lot of data) and then
>>> compare to the current event. Latency could be an issue then.
>>>
>>> To be more specific with my issue:
>>>
>>> My events have specific keys corresponding to some kind of user id. I
>>> want to process last events by each user id once ie skip intermediate
>>> events by user id.
>>> I have only one Kafka topic with all theses events.
>>>
>>> Regards,
>>>
>>> Julien Naour
>>>
>>> Le mer. 6 janv. 2016 à 16:13, Cody Koeninger  a
>>> écrit :
>>>
 Have you read

 http://kafka.apache.org/documentation.html#compaction



 On Wed, Jan 6, 2016 at 8:52 AM, Julien Naour 
 wrote:

> Context: Process data coming from Kafka and send back results to Kafka.
>
> Issue: Each events could take several seconds to process (Work in
> progress to improve that). During that time, events (and RDD) do
> accumulate. Intermediate events (by key) do not have to be processed, only
> the last ones. So when one process finished it would be ideal that Spark
> Streaming skip all events that are not the current last ones (by key).
>
> I'm not sure that the solution could be done using only Spark
> Streaming API. As I understand Spark Streaming, DStream RDD will 
> accumulate
> and be processed one by one and do not considerate if there are others
> afterwards.
>
> Possible solutions:
>
> Using only Spark Streaming API but I'm not sure how. updateStateByKey
> seems to be a solution. But I'm not sure that it will work properly when
> DStream RDD accumulate and you have to only process lasts events by key.
>
> Have two Spark Streaming pipelines. One to get last updated event by
> key, store that in a map or a database. The second pipeline processes
> events only if they are the last ones as indicate by the other pipeline.
>
>
> Sub questions for the second solution:
>
> Could two pipelines share the same sparkStreamingContext and process
> the same DStream at different speed (low processing vs high)?
>
> Is it easily possible to share values (map for example) between
> pipelines without using an external database? I think 
> accumulator/broadcast
> could work but between two pipelines I'm not sure.
>
> Regards,
>
> Julien Naour
>


>>


Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
oh, and I think I installed jekyll using "gem install jekyll"

On Wed, Jan 6, 2016 at 4:17 PM, Michael Armbrust 
wrote:

> from docs/ run:
>
> SKIP_API=1 jekyll serve --watch
>
> On Wed, Jan 6, 2016 at 4:12 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> Hi Michael
>>
>> I am happy to add some documentation.
>>
>> I forked the repo but am having trouble with the markdown. The code
>> examples are not rendering correctly. I am on a mac and using
>> https://itunes.apple.com/us/app/marked-2/id890031187?mt=12
>>
>> I use a emacs or some other text editor to change the md.
>>
>> What tools do you use for editing viewing spark markdown files?
>>
>> Andy
>>
>>
>>
>> From: Michael Armbrust 
>> Date: Wednesday, January 6, 2016 at 11:09 AM
>> To: Andrew Davidson 
>> Cc: "user @spark" 
>> Subject: Re: problem with DataFrame df.withColumn()
>> org.apache.spark.sql.AnalysisException: resolved attribute(s) missing
>>
>> I really appreciate your help. I The following code works.
>>>
>>
>> Glad you got it to work!
>>
>> Is there a way this example can be added to the distribution to make it
>>> easier for future java programmers? It look me a long time get to this
>>> simple solution.
>>>
>>
>> I'd welcome a pull request that added UDFs to the programming guide
>> section on dataframes:
>>
>> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
>>
>>
>


Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Michael Armbrust
from docs/ run:

SKIP_API=1 jekyll serve --watch

On Wed, Jan 6, 2016 at 4:12 PM, Andy Davidson  wrote:

> Hi Michael
>
> I am happy to add some documentation.
>
> I forked the repo but am having trouble with the markdown. The code
> examples are not rendering correctly. I am on a mac and using
> https://itunes.apple.com/us/app/marked-2/id890031187?mt=12
>
> I use a emacs or some other text editor to change the md.
>
> What tools do you use for editing viewing spark markdown files?
>
> Andy
>
>
>
> From: Michael Armbrust 
> Date: Wednesday, January 6, 2016 at 11:09 AM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: problem with DataFrame df.withColumn()
> org.apache.spark.sql.AnalysisException: resolved attribute(s) missing
>
> I really appreciate your help. I The following code works.
>>
>
> Glad you got it to work!
>
> Is there a way this example can be added to the distribution to make it
>> easier for future java programmers? It look me a long time get to this
>> simple solution.
>>
>
> I'd welcome a pull request that added UDFs to the programming guide
> section on dataframes:
>
> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-operations
>
>


Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Turns out that I should have specified -i to my former grep command :-)

Thanks Marcelo

But does this mean that specifying custom value for parameter
spark.memory.offheap.size
would not take effect ?

Cheers

On Wed, Jan 6, 2016 at 2:47 PM, Marcelo Vanzin  wrote:

> Try "git grep -i spark.memory.offheap.size"...
>
> On Wed, Jan 6, 2016 at 2:45 PM, Ted Yu  wrote:
> > Maybe I looked in the wrong files - I searched *.scala and *.java files
> (in
> > latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config.
> >
> > Can someone enlighten me ?
> >
> > Thanks
> >
> > On Wed, Jan 6, 2016 at 2:35 PM, Jakob Odersky 
> wrote:
> >>
> >> Check the configuration guide for a description on units
> >> (
> http://spark.apache.org/docs/latest/configuration.html#spark-properties).
> >> In your case, 5GB would be specified as 5g.
> >>
> >> On 6 January 2016 at 10:29, unk1102  wrote:
> >>>
> >>> Hi As part of Spark 1.6 release what should be ideal value or unit for
> >>> spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is
> >>> it
> >>> correct? Please guide.
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
> >>> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
> >>>
> >>> -
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> >>>
> >>
> >
>
>
>
> --
> Marcelo
>


Re: Timeout connecting between workers after upgrade to 1.6

2016-01-06 Thread Michael Armbrust
Logs from the workers?

On Wed, Jan 6, 2016 at 1:57 PM, Jeff Jones 
wrote:

> I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We
> are now seeing regular timeouts between two of the workers when making
> connections. These workers and the same driver code worked fine running on
> 1.4.1 and finished in under a second. Any thoughts on what might have
> changed?
>
> 16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning
> fetch of 1 outstanding blocks (after 3 retries)
> java.io.IOException: Connecting to /10.248.0.218:52104 timed out (12
> ms)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
> 16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3
> from BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
> java.io.IOException: Connecting to /10.248.0.218:52104 timed out (12
> ms)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> at
> org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> at
> org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
> at java.util.concurrent.FutureTask.run(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> at java.lang.Thread.run(Unknown Source)
>
>
>
> Thanks,
> Jeff
>
>
>
>
>
> This message (and any attachments) is intended only for the designated
> recipient(s). It
> may contain confidential or proprietary information, or have other
> limitations on use as
> indicated by the sender. If you are not a designated recipient, you may
> not review, use,
> copy or distribute this message. If you received this in error, please
> notify the sender by
> reply e-mail and delete this message.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Timeout connecting between workers after upgrade to 1.6

2016-01-06 Thread Jeff Jones
I upgraded our Spark standalone cluster from 1.4.1 to 1.6.0 yesterday. We are 
now seeing regular timeouts between two of the workers when making connections. 
These workers and the same driver code worked fine running on 1.4.1 and 
finished in under a second. Any thoughts on what might have changed?

16/01/06 19:17:58 ERROR RetryingBlockFetcher: Exception while beginning fetch 
of 1 outstanding blocks (after 3 retries)
java.io.IOException: Connecting to /10.248.0.218:52104 timed out (12 ms)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
16/01/06 19:17:58 WARN BlockManager: Failed to fetch remote block rdd_74_3 from 
BlockManagerId(1, 10.248.0.218, 52104) (failed attempt 1)
java.io.IOException: Connecting to /10.248.0.218:52104 timed out (12 ms)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:214)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:90)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
at java.util.concurrent.FutureTask.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)



Thanks,
Jeff





This message (and any attachments) is intended only for the designated 
recipient(s). It
may contain confidential or proprietary information, or have other limitations 
on use as
indicated by the sender. If you are not a designated recipient, you may not 
review, use,
copy or distribute this message. If you received this in error, please notify 
the sender by
reply e-mail and delete this message.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
These are my versions

cdh version = 5.4.1
spark version, 1.3.0
kafka = KAFKA-0.8.2.0-1.kafka1.3.1.p0.9
hbase versions = 1.0.0

Regards,
Nik.

On Wed, Jan 6, 2016 at 3:50 PM, Ted Yu  wrote:

> Which Spark / hadoop release are you using ?
>
> Thanks
>
> On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs 
> wrote:
>
>> Hello Team,
>>
>>
>> Thank you for your time in advance.
>>
>>
>> Below are the log lines of my spark job which is used for consuming the
>> messages from Kafka Instance and its loading to Hbase. I have noticed the
>> below Warn lines and later it resulted to errors. But I noticed that,
>> exactly after 7 days the token is getting expired and its trying to renew
>> the token but its not able to even after retrying it. Mine is a Kerberos
>> cluster. Can you please look into it and guide me whats the issue.
>>
>>
>> Your time and suggestions are very valuable.
>>
>>
>>
>> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141043 ms.0 from job set of time 145141043 ms
>>
>> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Starting job streaming job
>> 145141043 ms.1 from job set of time 145141043 ms
>>
>> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141043 ms.1 from job set of time 145141043 ms
>>
>> 15/12/29 11:33:50 INFO rdd.BlockRDD: Removing RDD 120956 from persistence
>> list
>>
>> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Total delay: 0.003 s for
>> time 145141043 ms (execution: 0.000 s)
>>
>> 15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120956
>>
>> 15/12/29 11:33:50 INFO kafka.KafkaInputDStream: Removing blocks of RDD
>> BlockRDD[120956] at createStream at SparkStreamingEngine.java:40 of time
>> 145141043 ms
>>
>> 15/12/29 11:33:50 INFO rdd.MapPartitionsRDD: Removing RDD 120957 from
>> persistence list
>>
>> 15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120957
>>
>> 15/12/29 11:33:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
>> ArrayBuffer(145141041 ms)
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Added jobs for time
>> 145141044 ms
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
>> 145141044 ms.0 from job set of time 145141044 ms
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141044 ms.0 from job set of time 145141044 ms
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
>> 145141044 ms.1 from job set of time 145141044 ms
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141044 ms.1 from job set of time 145141044 ms
>>
>> 15/12/29 11:34:00 INFO rdd.BlockRDD: Removing RDD 120958 from persistence
>> list
>>
>> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Total delay: 0.003 s for
>> time 145141044 ms (execution: 0.001 s)
>>
>> 15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120958
>>
>> 15/12/29 11:34:00 INFO kafka.KafkaInputDStream: Removing blocks of RDD
>> BlockRDD[120958] at createStream at SparkStreamingEngine.java:40 of time
>> 145141044 ms
>>
>> 15/12/29 11:34:00 INFO rdd.MapPartitionsRDD: Removing RDD 120959 from
>> persistence list
>>
>> 15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120959
>>
>> 15/12/29 11:34:00 INFO scheduler.ReceivedBlockTracker: Deleting batches
>> ArrayBuffer(145141042 ms)
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Added jobs for time
>> 145141045 ms
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
>> 145141045 ms.0 from job set of time 145141045 ms
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141045 ms.0 from job set of time 145141045 ms
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
>> 145141045 ms.1 from job set of time 145141045 ms
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
>> 145141045 ms.1 from job set of time 145141045 ms
>>
>> 15/12/29 11:34:10 INFO rdd.BlockRDD: Removing RDD 120960 from persistence
>> list
>>
>> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Total delay: 0.004 s for
>> time 145141045 ms (execution: 0.001 s)
>>
>> 15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120960
>>
>> 15/12/29 11:34:10 INFO kafka.KafkaInputDStream: Removing blocks of RDD
>> BlockRDD[120960] at createStream at SparkStreamingEngine.java:40 of time
>> 145141045 ms
>>
>> 15/12/29 11:34:10 INFO rdd.MapPartitionsRDD: Removing RDD 120961 from
>> persistence list
>>
>> 15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120961
>>
>> 15/12/29 11:34:10 INFO scheduler.ReceivedBlockTracker: Deleting batches
>> ArrayBuffer(145141043 ms)
>>
>> 15/12/29 11:34:13 WARN security.UserGroupInformation:
>> PriviledgedActionException as:s (auth:SIMPLE)
>> 

Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Marcelo Vanzin
Try "git grep -i spark.memory.offheap.size"...

On Wed, Jan 6, 2016 at 2:45 PM, Ted Yu  wrote:
> Maybe I looked in the wrong files - I searched *.scala and *.java files (in
> latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config.
>
> Can someone enlighten me ?
>
> Thanks
>
> On Wed, Jan 6, 2016 at 2:35 PM, Jakob Odersky  wrote:
>>
>> Check the configuration guide for a description on units
>> (http://spark.apache.org/docs/latest/configuration.html#spark-properties).
>> In your case, 5GB would be specified as 5g.
>>
>> On 6 January 2016 at 10:29, unk1102  wrote:
>>>
>>> Hi As part of Spark 1.6 release what should be ideal value or unit for
>>> spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is
>>> it
>>> correct? Please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>



-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Date and Time as a Feature

2016-01-06 Thread Jorge Machado
Hello all, 

I'm new to machine learning. I'm trying to predict some electric usage . 
The data is : 
2015-12-10-10:00, 1200
2015-12-11-10:00, 1150

My question is : What is the best way to turn date and time into feature on my 
Vector ? 

Something like this :  Vector (1200, [2015,12,10,10,10] )? 
I could not fine any example with value prediction where features had dates in 
it.

Thanks 

Jorge Machado 
jo...@jmachado.me


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem with DataFrame df.withColumn() org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

2016-01-06 Thread Andy Davidson
Hi Michael

I am happy to add some documentation.

I forked the repo but am having trouble with the markdown. The code examples
are not rendering correctly. I am on a mac and using
https://itunes.apple.com/us/app/marked-2/id890031187?mt=12

I use a emacs or some other text editor to change the md.

What tools do you use for editing viewing spark markdown files?

Andy



From:  Michael Armbrust 
Date:  Wednesday, January 6, 2016 at 11:09 AM
To:  Andrew Davidson 
Cc:  "user @spark" 
Subject:  Re: problem with DataFrame df.withColumn()
org.apache.spark.sql.AnalysisException: resolved attribute(s) missing

>> I really appreciate your help. I The following code works.
> 
> Glad you got it to work!
> 
>> Is there a way this example can be added to the distribution to make it
>> easier for future java programmers? It look me a long time get to this simple
>> solution.
> 
> I'd welcome a pull request that added UDFs to the programming guide section on
> dataframes:
> http://spark.apache.org/docs/latest/sql-programming-guide.html#dataframe-opera
> tions




Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Ted Yu
Maybe I looked in the wrong files - I searched *.scala and *.java files (in
latest Spark 1.6.0 RC) for '.offheap.' but didn't find the config.

Can someone enlighten me ?

Thanks

On Wed, Jan 6, 2016 at 2:35 PM, Jakob Odersky  wrote:

> Check the configuration guide for a description on units (
> http://spark.apache.org/docs/latest/configuration.html#spark-properties).
> In your case, 5GB would be specified as 5g.
>
> On 6 January 2016 at 10:29, unk1102  wrote:
>
>> Hi As part of Spark 1.6 release what should be ideal value or unit for
>> spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it
>> correct? Please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Tathagata Das
Could you show a sample of the file names? There are multiple things that
are using UUIDs so would be good to see what are 100s of directories that
being generated every second.
If you are checkpointing every 400s then there shouldnt be checkpoint
directories written every second. They should be huge bunches written every
400s.

On Wed, Jan 6, 2016 at 3:13 PM, Jan Algermissen 
wrote:

> Hi,
>
> we are running a streaming job that processes about 500 events per 20s
> batches and uses updateStateByKey to accumulate Web sessions (with a 30
> Minute live time).
>
> The checkpoint intervall is set to 20xBatchInterval, that is 400s.
>
> Cluster size is 8 nodes.
>
> We are having trouble with the amount of files and directories created on
> the shared file system (GlusterFS) - there are about 100 new directories
> per second.
>
> Is that the expected magnitude of number of created directories? Or should
> we expect something different?
>
> What might we be doing wrong?  Can anyone share a pointer to material that
> explains the details of checkpointing?
>
> The checkpoint directories have UUIDs as names - ist that correct?
>
> Jan
>
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


connecting beeline to spark sql thrift server

2016-01-06 Thread Sunil Kumar


Hi,
I have an AWS spark EMR cluster running with spark 1.5.2, hadoop 2.6 and hive 
1.0.0I brought up the spark sql thriftserver on this cluster with 
spark.sql.hive.metastore version set to 1.0
When I try to connect to this thriftserver remotely using beeline packaged 
with spark-1.5.2-hadoop2.6, it gives me this thrift error :16/01/06 23:53:51 
ERROR TThreadPoolServer: Thrift error occurred during processing of 
message.org.apache.thrift.protocol.TProtocolException: Missing version in 
readMessageBegin, old client? at 
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:228)
 at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:27) at 
org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
 at 
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:285)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)at
 java.lang.Thread.run(Thread.java:745)

AWS has this thrift jar for hive libthrift-0.9.0.jar
How do I make sure that beeline uses the same thrift version ? Is there a way 
to see the thrift versions on client/server side ?
thanks


Re: Why is this job running since one hour?

2016-01-06 Thread Jakob Odersky
What is the job doing? How much data are you processing?

On 6 January 2016 at 10:33, unk1102  wrote:

> Hi I have one main Spark job which spawns multiple child spark jobs. One of
> the child spark job is running for an hour and it keeps on hanging there I
> have taken snap shot please see
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n25899/Screen_Shot_2016-01-06_at_11.jpg
> >
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-is-this-job-running-since-one-hour-tp25899.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Jan Algermissen
Hi,

we are running a streaming job that processes about 500 events per 20s batches 
and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live 
time).

The checkpoint intervall is set to 20xBatchInterval, that is 400s.

Cluster size is 8 nodes.

We are having trouble with the amount of files and directories created on the 
shared file system (GlusterFS) - there are about 100 new directories per second.

Is that the expected magnitude of number of created directories? Or should we 
expect something different?

What might we be doing wrong?  Can anyone share a pointer to material that 
explains the details of checkpointing?

The checkpoint directories have UUIDs as names - ist that correct?

Jan





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: What should be the ideal value(unit) for spark.memory.offheap.size

2016-01-06 Thread Jakob Odersky
Check the configuration guide for a description on units (
http://spark.apache.org/docs/latest/configuration.html#spark-properties).
In your case, 5GB would be specified as 5g.

On 6 January 2016 at 10:29, unk1102  wrote:

> Hi As part of Spark 1.6 release what should be ideal value or unit for
> spark.memory.offheap.size I have set as 5000 I assume it will be 5GB is it
> correct? Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-should-be-the-ideal-value-unit-for-spark-memory-offheap-size-tp25898.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Need Help in Spark Hive Data Processing

2016-01-06 Thread Balaraju.Kagidala Kagidala
Hi ,

  I am new user to spark. I am trying to use Spark to process huge Hive
data using Spark DataFrames.


I have 5 node Spark cluster each with 30 GB memory. i am want to process
hive table with 450GB data using DataFrames. To fetch single row from Hive
table its taking 36 mins. Pls suggest me what wrong here and any help is
appreciated.


Thanks
Bala


Re: java.io.FileNotFoundException(Too many open files) in Spark streaming

2016-01-06 Thread Priya Ch
The line of code which I highlighted in the screenshot is within the spark
source code. Spark implements sort-based shuffle implementation and the
spilled files are merged using the merge sort.

Here is the link
https://issues.apache.org/jira/secure/attachment/12655884/Sort-basedshuffledesign.pdf
which would convey the same.

On Wed, Jan 6, 2016 at 8:19 PM, Annabel Melongo 
wrote:

> Priya,
>
> It would be helpful if you put the entire trace log along with your code
> to help determine the root cause of the error.
>
> Thanks
>
>
> On Wednesday, January 6, 2016 4:00 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Running 'lsof' will let us know the open files but how do we come to know
> the root cause behind opening too many files.
>
> Thanks,
> Padma CH
>
> On Wed, Jan 6, 2016 at 8:39 AM, Hamel Kothari 
> wrote:
>
> The "Too Many Files" part of the exception is just indicative of the fact
> that when that call was made, too many files were already open. It doesn't
> necessarily mean that that line is the source of all of the open files,
> that's just the point at which it hit its limit.
>
> What I would recommend is to try to run this code again and use "lsof" on
> one of the spark executors (perhaps run it in a for loop, writing the
> output to separate files) until it fails and see which files are being
> opened, if there's anything that seems to be taking up a clear majority
> that might key you in on the culprit.
>
> On Tue, Jan 5, 2016 at 9:48 PM Priya Ch 
> wrote:
>
> Yes, the fileinputstream is closed. May be i didn't show in the screen
> shot .
>
> As spark implements, sort-based shuffle, there is a parameter called
> maximum merge factor which decides the number of files that can be merged
> at once and this avoids too many open files. I am suspecting that it is
> something related to this.
>
> Can someone confirm on this ?
>
> On Tue, Jan 5, 2016 at 11:19 PM, Annabel Melongo <
> melongo_anna...@yahoo.com> wrote:
>
> Vijay,
>
> Are you closing the fileinputstream at the end of each loop ( in.close())?
> My guess is those streams aren't close and thus the "too many open files"
> exception.
>
>
> On Tuesday, January 5, 2016 8:03 AM, Priya Ch <
> learnings.chitt...@gmail.com> wrote:
>
>
> Can some one throw light on this ?
>
> Regards,
> Padma Ch
>
> On Mon, Dec 28, 2015 at 3:59 PM, Priya Ch 
> wrote:
>
> Chris, we are using spark 1.3.0 version. we have not set  
> spark.streaming.concurrentJobs
> this parameter. It takes the default value.
>
> Vijay,
>
>   From the tack trace it is evident that 
> org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$1.apply$mcVI$sp(ExternalSorter.scala:730)
> is throwing the exception. I opened the spark source code and visited the
> line which is throwing this exception i.e
>
> [image: Inline image 1]
>
> The lie which is marked in red is throwing the exception. The file is
> ExternalSorter.scala in org.apache.spark.util.collection package.
>
> i went through the following blog
> http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
> and understood that there is merge factor which decide the number of
> on-disk files that could be merged. Is it some way related to this ?
>
> Regards,
> Padma CH
>
> On Fri, Dec 25, 2015 at 7:51 PM, Chris Fregly  wrote:
>
> and which version of Spark/Spark Streaming are you using?
>
> are you explicitly setting the spark.streaming.concurrentJobs to
> something larger than the default of 1?
>
> if so, please try setting that back to 1 and see if the problem still
> exists.
>
> this is a dangerous parameter to modify from the default - which is why
> it's not well-documented.
>
>
> On Wed, Dec 23, 2015 at 8:23 AM, Vijay Gharge 
> wrote:
>
> Few indicators -
>
> 1) during execution time - check total number of open files using lsof
> command. Need root permissions. If it is cluster not sure much !
> 2) which exact line in the code is triggering this error ? Can you paste
> that snippet ?
>
>
> On Wednesday 23 December 2015, Priya Ch 
> wrote:
>
> ulimit -n 65000
>
> fs.file-max = 65000 ( in etc/sysctl.conf file)
>
> Thanks,
> Padma Ch
>
> On Tue, Dec 22, 2015 at 6:47 PM, Yash Sharma  wrote:
>
> Could you share the ulimit for your setup please ?
> - Thanks, via mobile,  excuse brevity.
> On Dec 22, 2015 6:39 PM, "Priya Ch"  wrote:
>
> Jakob,
>
>Increased the settings like fs.file-max in /etc/sysctl.conf and also
> increased user limit in /etc/security/limits.conf. But still see the same
> issue.
>
> On Fri, Dec 18, 2015 at 12:54 AM, Jakob Odersky 
> wrote:
>
> It might be a good idea to see how many files are open and try increasing
> the open file limit (this is done on an os level). In some application
> 

Re: Out of memory issue

2016-01-06 Thread Muthu Jayakumar
Thanks Ewan Leith. This seems like a good start, as it seem to match up to
the symptoms I am seeing :).

But, how do I specify "parquet.memory.pool.ratio"?
Parquet code seem to take this parameter from
ParquetOutputFormat.getRecordWriter()
(ref code: float maxLoadconf.getFloat(ParquetOutputFormat.MEMORY_POOL_RATIO,
MemoryManager.DEFAULT_MEMORY_POOL_RATIO);).
I wonder how is this provided thru Apache Spark. Meaning, I see that
'TaskAttemptContext' seems to be the hint to provide this. But I am not
able to find a way I could provide this configuration.

Please advice,
Muthu

On Wed, Jan 6, 2016 at 1:57 AM, Ewan Leith 
wrote:

> Hi Muthu, this could be related to a known issue in the release notes
>
> http://spark.apache.org/releases/spark-release-1-6-0.html
>
> Known issues
>
> SPARK-12546 -  Save DataFrame/table as Parquet with dynamic partitions
> may cause OOM; this can be worked around by decreasing the memory used by
> both Spark and Parquet using spark.memory.fraction (for example, 0.4) and
> parquet.memory.pool.ratio (for example, 0.3, in Hadoop configuration, e.g.
> setting it in core-site.xml).
>
> It's definitely worth setting spark.memory.fraction and
> parquet.memory.pool.ratio and trying again.
>
> Ewan
>
> -Original Message-
> From: babloo80 [mailto:bablo...@gmail.com]
> Sent: 06 January 2016 03:44
> To: user@spark.apache.org
> Subject: Out of memory issue
>
> Hello there,
>
> I have a spark job reads 7 parquet files (8 GB, 3 x 16 GB, 3 x 14 GB) in
> different stages of execution and creates a result parquet of 9 GB (about
> 27 million rows containing 165 columns. some columns are map based
> containing utmost 200 value histograms). The stages involve, Step 1:
> Reading the data using dataframe api Step 2: Transform dataframe to RDD (as
> the some of the columns are transformed into histograms (using empirical
> distribution to cap the number of keys) and some of them run like UDAF
> during reduce-by-key step) to perform and perform some transformations Step
> 3: Reduce the result by key so that the resultant can be used in the next
> stage for join Step 4: Perform left outer join of this result which runs
> similar Steps 1 thru 3.
> Step 5: The results are further reduced to be written to parquet
>
> With Apache Spark 1.5.2, I am able to run the job with no issues.
> Current env uses 8 nodes running a total of  320 cores, 100 GB executor
> memory per node with driver program using 32 GB. The approximate execution
> time is about 1.2 hrs. The parquet files are stored in another HDFS cluster
> for read and eventual write of the result.
>
> When the same job is executed using Apache 1.6.0, some of the executor
> node's JVM gets restarted (with a new executor id). On further turning-on
> GC stats on the executor, the perm-gen seem to get maxed out and ends up
> showing the symptom of out-of-memory.
>
> Please advice on where to start investigating this issue.
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Out-of-memory-issue-tp25888.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>


Re: Need Help in Spark Hive Data Processing

2016-01-06 Thread Jeff Zhang
It depends on how you fetch the single row. Does your query complex ?

On Thu, Jan 7, 2016 at 12:47 PM, Balaraju.Kagidala Kagidala <
balaraju.kagid...@gmail.com> wrote:

> Hi ,
>
>   I am new user to spark. I am trying to use Spark to process huge Hive
> data using Spark DataFrames.
>
>
> I have 5 node Spark cluster each with 30 GB memory. i am want to process
> hive table with 450GB data using DataFrames. To fetch single row from Hive
> table its taking 36 mins. Pls suggest me what wrong here and any help is
> appreciated.
>
>
> Thanks
> Bala
>
>
>


-- 
Best Regards

Jeff Zhang


Re: org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtAiQta22XrCI

On Wed, Jan 6, 2016 at 8:41 PM, Jia Zou  wrote:

> Dear all,
>
> I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with
> inputRDD.persist(StorageLevel.OFF_HEAP()).
>
> I've set tired storage for Tachyon. It is all right when working set is
> smaller than available memory. However, when working set exceeds available
> memory, I keep getting errors like below:
>
> 16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.1 in stage
> 0.0 (TID 206) on executor 10.149.11.81: java.lang.RuntimeException
> (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found
>
> 16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.1 in stage
> 0.0 (TID 207) on executor 10.149.11.81: java.lang.RuntimeException
> (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found
>
> 16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.2 in stage
> 0.0 (TID 208) on executor 10.149.11.81: java.lang.RuntimeException
> (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found
>
> 16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.2 in stage
> 0.0 (TID 209) on executor 10.149.11.81: java.lang.RuntimeException
> (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found
>
> 16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.3 in stage
> 0.0 (TID 210) on executor 10.149.11.81: java.lang.RuntimeException
> (org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found
>
>
> Can any one give me some suggestions? Thanks a lot!
>
>
> Best Regards,
> Jia
>


spark dataframe read large mysql table running super slow

2016-01-06 Thread fightf...@163.com
Hi, 
Recently I am planning to use spark sql to run some tests over large mysql 
datatable, and trying to 
compare the performance between spark and mycat. However, the load is super 
slow and hope someone 
can help tune on this. 
Environment: Spark 1.4.1  
Code snipet: 
   val prop = new java.util.Properties
   prop.setProperty("user","root")
   prop.setProperty("password", "123456")

   val url1 = "jdbc:mysql://localhost:3306/db1"
   val jdbcDF = sqlContext.read.jdbc(url1,"video",prop)
   jdbcDF.registerTempTable("video_test")
   sqlContext.sql("select count(1) from video_test").show()

Overally the load process would stuck and get connection timeout. Mysql table 
hold about 100 million records.
Would be happy to provide more usable info. 

Best,
Sun.
   



fightf...@163.com


Re: LogisticsRegression in ML pipeline help page

2016-01-06 Thread Wen Pei Yu

You can get old resource under
http://spark.apache.org/documentation.html

And linear doc here for 1.5.2

http://spark.apache.org/docs/1.5.2/mllib-linear-methods.html#logistic-regression
http://spark.apache.org/docs/1.5.2/ml-linear-methods.html


Regards.
Yu Wenpei.


From:   Arunkumar Pillai 
To: user@spark.apache.org
Date:   01/07/2016 12:54 PM
Subject:LogisticsRegression in ML pipeline help page



Hi

I need help page for Logistics Regression in ML pipeline. when i browsed
I'm getting the 1.6 help please help me.

--
Thanks and Regards
        Arun


org.apache.spark.storage.BlockNotFoundException in Spark1.5.2+Tachyon0.7.1

2016-01-06 Thread Jia Zou
Dear all,

I am using Spark1.5.2 and Tachyon0.7.1 to run KMeans with
inputRDD.persist(StorageLevel.OFF_HEAP()).

I've set tired storage for Tachyon. It is all right when working set is
smaller than available memory. However, when working set exceeds available
memory, I keep getting errors like below:

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.1 in stage
0.0 (TID 206) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.1 in stage
0.0 (TID 207) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.2 in stage
0.0 (TID 208) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 191.2 in stage
0.0 (TID 209) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_191 not found

16/01/07 04:18:53 INFO scheduler.TaskSetManager: Lost task 197.3 in stage
0.0 (TID 210) on executor 10.149.11.81: java.lang.RuntimeException
(org.apache.spark.storage.BlockNotFoundException: Block rdd_1_197 not found


Can any one give me some suggestions? Thanks a lot!


Best Regards,
Jia


LogisticsRegression in ML pipeline help page

2016-01-06 Thread Arunkumar Pillai
Hi

I need help page for Logistics Regression in ML pipeline. when i browsed
I'm getting the 1.6 help please help me.

-- 
Thanks and Regards
Arun


Re: Update Hive tables from Spark without loading entire table in to a dataframe

2016-01-06 Thread Jörn Franke
You can mark the table as transactional and then you can do single updates. 

> On 07 Jan 2016, at 08:10, sudhir  wrote:
> 
> Hi,
> 
> I have a hive table of 20Lakh records and to update a row I have to load the
> entire table in dataframe and process that and then Save it (SaveAsTable in
> Overwrite mode)
> 
> When I tried to update using objHiveContext.sql("Update myTable set
> columnName='test' ") , I canno so this
> I'm using spark 1.4.1 and hive 1.2.1
> 
> How can I update a Hive table directly without overhead of loading it
> 
> Note: It is a ORC table.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Update-Hive-tables-from-Spark-without-loading-entire-table-in-to-a-dataframe-tp25902.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need Help in Spark Hive Data Processing

2016-01-06 Thread Jörn Franke
You need the table in an efficient format, such as Orc or parquet. Have the 
table sorted appropriately (hint: most discriminating column in the where 
clause). Do not use SAN or virtualization for the slave nodes.

Can you please post your query.

I always recommend to avoid single updates where possible. They are very 
inefficient for analytics scenarios - this is somehow also true for the 
traditional database world (depends on the use case of course).

> On 07 Jan 2016, at 05:47, Balaraju.Kagidala Kagidala 
>  wrote:
> 
> Hi ,
> 
>   I am new user to spark. I am trying to use Spark to process huge Hive data 
> using Spark DataFrames.
> 
> 
> I have 5 node Spark cluster each with 30 GB memory. i am want to process hive 
> table with 450GB data using DataFrames. To fetch single row from Hive table 
> its taking 36 mins. Pls suggest me what wrong here and any help is 
> appreciated.
> 
> 
> Thanks
> Bala
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem building spark on centos

2016-01-06 Thread Todd Nist
That should read "I think your missing the --name option".  Sorry about
that.

On Wed, Jan 6, 2016 at 3:03 PM, Todd Nist  wrote:

> Hi Jade,
>
> I think you "--name" option. The makedistribution should look like this:
>
> ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests.
>
> As for why it failed to build with scala 2.11, did you run the
> ./dev/change-scala-version.sh 2.11 script to set the version of the
> artifacts to 2.11?  If you do that then issue the build like this I think
> you will be ok:
>
> ./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
> -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
> -DskipTests
>
> HTH.
>
> -Todd
>
> On Wed, Jan 6, 2016 at 2:20 PM, Jade Liu  wrote:
>
>> I’ve changed the scala version to 2.10.
>>
>> With this command:
>> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
>> package
>> Build was successful.
>>
>> But make a runnable version:
>> /make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
>>  -Phive -Phive-thriftserver –DskipTests
>> Still fails with the following error:
>> [ERROR] Failed to execute goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
>> on project spark-launcher_2.10: Execution scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
>> -> [Help 1]
>> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
>> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
>> (scala-compile-first) on project spark-launcher_2.10: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
>> at
>> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
>> at
>> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
>> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
>> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
>> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
>> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
>> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
>> at
>> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
>> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
>> scala-compile-first of goal
>> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
>> at
>> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
>> at
>> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
>> ... 20 more
>> Caused by: Compile failed via zinc server
>> at
>> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
>> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
>> at
>> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
>> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
>> at
>> scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
>> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
>> at
>> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>>
>> Not sure what’s causing it. Does anyone have any idea?
>>
>> Thanks!
>>
>> Jade
>> From: Ted Yu 
>> Date: Wednesday, January 6, 2016 at 10:40 AM
>> To: Jade Liu , user 
>>
>> Subject: Re: problem building spark on centos
>>
>> w.r.t. the second error, have you read this ?
>>
>> 

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Hi Jade,

I think you "--name" option. The makedistribution should look like this:

./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests.

As for why it failed to build with scala 2.11, did you run the
./dev/change-scala-version.sh 2.11 script to set the version of the
artifacts to 2.11?  If you do that then issue the build like this I think
you will be ok:

./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn
-Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11
-DskipTests

HTH.

-Todd

On Wed, Jan 6, 2016 at 2:20 PM, Jade Liu  wrote:

> I’ve changed the scala version to 2.10.
>
> With this command:
> build/mvn -X -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean
> package
> Build was successful.
>
> But make a runnable version:
> /make-distribution.sh --tgz -Phadoop-2.6 -Pyarn -Dhadoop.version=2.6.0
>  -Phive -Phive-thriftserver –DskipTests
> Still fails with the following error:
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-launcher_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-launcher_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
>
> Not sure what’s causing it. Does anyone have any idea?
>
> Thanks!
>
> Jade
> From: Ted Yu 
> Date: Wednesday, January 6, 2016 at 10:40 AM
> To: Jade Liu , user 
>
> Subject: Re: problem building spark on centos
>
> w.r.t. the second error, have you read this ?
>
> http://www.captaindebug.com/2013/03/mavens-non-resolvable-parent-pom-problem.html#.Vo1fFGSrSuo
>
> On Wed, Jan 6, 2016 at 9:49 AM, Jade Liu  wrote:
>
>> I’m using 3.3.9. Thanks!
>>
>> Jade
>>
>> From: Ted Yu 
>> Date: Tuesday, January 5, 2016 at 4:57 PM
>> To: Jade Liu 
>> Cc: 

Spark Token Expired Exception

2016-01-06 Thread Nikhil Gs
Hello Team,


Thank you for your time in advance.


Below are the log lines of my spark job which is used for consuming the
messages from Kafka Instance and its loading to Hbase. I have noticed the
below Warn lines and later it resulted to errors. But I noticed that,
exactly after 7 days the token is getting expired and its trying to renew
the token but its not able to even after retrying it. Mine is a Kerberos
cluster. Can you please look into it and guide me whats the issue.


Your time and suggestions are very valuable.



15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
145141043 ms.0 from job set of time 145141043 ms

15/12/29 11:33:50 INFO scheduler.JobScheduler: Starting job streaming job
145141043 ms.1 from job set of time 145141043 ms

15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
145141043 ms.1 from job set of time 145141043 ms

15/12/29 11:33:50 INFO rdd.BlockRDD: Removing RDD 120956 from persistence
list

15/12/29 11:33:50 INFO scheduler.JobScheduler: Total delay: 0.003 s for
time 145141043 ms (execution: 0.000 s)

15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120956

15/12/29 11:33:50 INFO kafka.KafkaInputDStream: Removing blocks of RDD
BlockRDD[120956] at createStream at SparkStreamingEngine.java:40 of time
145141043 ms

15/12/29 11:33:50 INFO rdd.MapPartitionsRDD: Removing RDD 120957 from
persistence list

15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120957

15/12/29 11:33:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(145141041 ms)

15/12/29 11:34:00 INFO scheduler.JobScheduler: Added jobs for time
145141044 ms

15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
145141044 ms.0 from job set of time 145141044 ms

15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
145141044 ms.0 from job set of time 145141044 ms

15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
145141044 ms.1 from job set of time 145141044 ms

15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
145141044 ms.1 from job set of time 145141044 ms

15/12/29 11:34:00 INFO rdd.BlockRDD: Removing RDD 120958 from persistence
list

15/12/29 11:34:00 INFO scheduler.JobScheduler: Total delay: 0.003 s for
time 145141044 ms (execution: 0.001 s)

15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120958

15/12/29 11:34:00 INFO kafka.KafkaInputDStream: Removing blocks of RDD
BlockRDD[120958] at createStream at SparkStreamingEngine.java:40 of time
145141044 ms

15/12/29 11:34:00 INFO rdd.MapPartitionsRDD: Removing RDD 120959 from
persistence list

15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120959

15/12/29 11:34:00 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(145141042 ms)

15/12/29 11:34:10 INFO scheduler.JobScheduler: Added jobs for time
145141045 ms

15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
145141045 ms.0 from job set of time 145141045 ms

15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
145141045 ms.0 from job set of time 145141045 ms

15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
145141045 ms.1 from job set of time 145141045 ms

15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
145141045 ms.1 from job set of time 145141045 ms

15/12/29 11:34:10 INFO rdd.BlockRDD: Removing RDD 120960 from persistence
list

15/12/29 11:34:10 INFO scheduler.JobScheduler: Total delay: 0.004 s for
time 145141045 ms (execution: 0.001 s)

15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120960

15/12/29 11:34:10 INFO kafka.KafkaInputDStream: Removing blocks of RDD
BlockRDD[120960] at createStream at SparkStreamingEngine.java:40 of time
145141045 ms

15/12/29 11:34:10 INFO rdd.MapPartitionsRDD: Removing RDD 120961 from
persistence list

15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120961

15/12/29 11:34:10 INFO scheduler.ReceivedBlockTracker: Deleting batches
ArrayBuffer(145141043 ms)

15/12/29 11:34:13 WARN security.UserGroupInformation:
PriviledgedActionException as:s (auth:SIMPLE)
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
token (HDFS_DELEGATION_TOKEN token 3104414 for s) is expired

15/12/29 11:34:13 *WARN ipc.Client: Exception encountered while connecting
to the server* :
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
token (HDFS_DELEGATION_TOKEN token 3104414 for s) is expired

15/12/29 11:34:13 *WARN security.UserGroupInformation:
PriviledgedActionException as:s (auth:SIMPLE) *
cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):*
token 

Re: error writing to stdout

2016-01-06 Thread Bryan Cutler
This is a known issue https://issues.apache.org/jira/browse/SPARK-9844.  As
Noorul said, it is probably safe to ignore as the executor process is
already destroyed at this point.

On Mon, Dec 21, 2015 at 8:54 PM, Noorul Islam K M  wrote:

> carlilek  writes:
>
> > My users use Spark 1.5.1 in standalone mode on an HPC cluster, with a
> > smattering still using 1.4.0
> >
> > I have been getting reports of errors like this:
> >
> > 15/12/21 15:40:33 ERROR FileAppender: Error writing stream to file
> > /scratch/spark/work/app-20151221150645-/3/stdout
> > java.io.IOException: Stream closed
> >   at
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:162)
> >   at java.io.BufferedInputStream.read1(BufferedInputStream.java:272)
> >   at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> >   at java.io.FilterInputStream.read(FilterInputStream.java:107)
> >   at
> >
> org.apache.spark.util.logging.FileAppender.appendStreamToFile(FileAppender.scala:70)
> >   at
> >
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply$mcV$sp(FileAppender.scala:39)
> >   at
> >
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> >   at
> >
> org.apache.spark.util.logging.FileAppender$$anon$1$$anonfun$run$1.apply(FileAppender.scala:39)
> >   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699)
> >   at
> >
> org.apache.spark.util.logging.FileAppender$$anon$1.run(FileAppender.scala:38)
> > '
> >
> > So far I have been unable to reproduce reliably, but does anyone have any
> > ideas?
> >
>
> I have seen this happening in our cluster also. So far I have been
> ignoring this.
>
> Thanks and Regards
> Noorul
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: problem building spark on centos

2016-01-06 Thread Jade Liu
Hi, Todd:

Thanks for your suggestion. Yes I did run the ./dev/change-scala-version.sh 
2.11 script when using scala version 2.11.

I just tried this as you suggested:
./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests

Still got the same error:
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
Caused by: Compile failed via zinc server
at sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
... 21 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-launcher_2.10

Do you think it’s java problem? I’m using oracle JDK 1.7. Should I update it to 
1.8 instead? I just checked the conf and it says 1.7.

Thanks,

Jade

From: Todd Nist >
Date: Wednesday, January 6, 2016 at 12:03 PM
To: Jade Liu >
Cc: Ted Yu >, 
"user@spark.apache.org" 
>
Subject: Re: problem building spark on centos

Hi Jade,

I think you "--name" option. The makedistribution should look like this:


./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests.

As for why it failed to build with scala 2.11, did you run the 
./dev/change-scala-version.sh 2.11 script to set the version of the artifacts 
to 2.11?  If you do that then issue the build like this I think you will be ok:


./make-distribution.sh --name hadoop-2.6_scala-2.11 --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -Dscala-2.11 

Re: problem building spark on centos

2016-01-06 Thread Todd Nist
Not sure, I just built it with java 8, but 7 is supported so that should be
fine.  Are you using maven 3.3.3 + ?

RADTech:spark-1.5.2 tnist$ mvn -version
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option
MaxPermSize=512m; support was removed in 8.0
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06;
2015-04-22T07:57:37-04:00)
Maven home: /usr/local/maven
Java version: 1.8.0_51, vendor: Oracle Corporation
Java home:
/Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"



On Wed, Jan 6, 2016 at 3:27 PM, Jade Liu  wrote:

> Hi, Todd:
>
> Thanks for your suggestion. Yes I did run the
> ./dev/change-scala-version.sh 2.11 script when using scala version 2.11.
>
> I just tried this as you suggested:
> ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests
>
> Still got the same error:
> [ERROR] Failed to execute goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first)
> on project spark-launcher_2.10: Execution scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed
> -> [Help 1]
> org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute
> goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile
> (scala-compile-first) on project spark-launcher_2.10: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
> at
> org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
> at
> org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
> at
> org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
> at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
> at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
> at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
> at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
> at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
> at
> org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
> Caused by: org.apache.maven.plugin.PluginExecutionException: Execution
> scala-compile-first of goal
> net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
> at
> org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
> ... 20 more
> Caused by: Compile failed via zinc server
> at
> sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
> at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
> at
> scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
> at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
> at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
> at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
> at
> org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
> ... 21 more
> [ERROR]
> [ERROR]
> [ERROR] For more information about the errors and possible solutions,
> please read the following articles:
> [ERROR] [Help 1]
> http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
> [ERROR]
> [ERROR] After correcting the problems, you can resume the build with the
> command
> [ERROR]   mvn  -rf :spark-launcher_2.10
>
> Do you think it’s java problem? I’m using oracle JDK 1.7. Should I update
> it to 1.8 instead? I just checked the conf and it says 1.7.
>
> Thanks,
>
> Jade
>
> From: Todd Nist 
> Date: Wednesday, January 6, 2016 at 12:03 PM
> To: Jade Liu 
> Cc: Ted 

Re: Spark Token Expired Exception

2016-01-06 Thread Ted Yu
Which Spark / hadoop release are you using ?

Thanks

On Wed, Jan 6, 2016 at 12:16 PM, Nikhil Gs 
wrote:

> Hello Team,
>
>
> Thank you for your time in advance.
>
>
> Below are the log lines of my spark job which is used for consuming the
> messages from Kafka Instance and its loading to Hbase. I have noticed the
> below Warn lines and later it resulted to errors. But I noticed that,
> exactly after 7 days the token is getting expired and its trying to renew
> the token but its not able to even after retrying it. Mine is a Kerberos
> cluster. Can you please look into it and guide me whats the issue.
>
>
> Your time and suggestions are very valuable.
>
>
>
> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
> 145141043 ms.0 from job set of time 145141043 ms
>
> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Starting job streaming job
> 145141043 ms.1 from job set of time 145141043 ms
>
> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Finished job streaming job
> 145141043 ms.1 from job set of time 145141043 ms
>
> 15/12/29 11:33:50 INFO rdd.BlockRDD: Removing RDD 120956 from persistence
> list
>
> 15/12/29 11:33:50 INFO scheduler.JobScheduler: Total delay: 0.003 s for
> time 145141043 ms (execution: 0.000 s)
>
> 15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120956
>
> 15/12/29 11:33:50 INFO kafka.KafkaInputDStream: Removing blocks of RDD
> BlockRDD[120956] at createStream at SparkStreamingEngine.java:40 of time
> 145141043 ms
>
> 15/12/29 11:33:50 INFO rdd.MapPartitionsRDD: Removing RDD 120957 from
> persistence list
>
> 15/12/29 11:33:50 INFO storage.BlockManager: Removing RDD 120957
>
> 15/12/29 11:33:50 INFO scheduler.ReceivedBlockTracker: Deleting batches
> ArrayBuffer(145141041 ms)
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Added jobs for time
> 145141044 ms
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
> 145141044 ms.0 from job set of time 145141044 ms
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
> 145141044 ms.0 from job set of time 145141044 ms
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Starting job streaming job
> 145141044 ms.1 from job set of time 145141044 ms
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Finished job streaming job
> 145141044 ms.1 from job set of time 145141044 ms
>
> 15/12/29 11:34:00 INFO rdd.BlockRDD: Removing RDD 120958 from persistence
> list
>
> 15/12/29 11:34:00 INFO scheduler.JobScheduler: Total delay: 0.003 s for
> time 145141044 ms (execution: 0.001 s)
>
> 15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120958
>
> 15/12/29 11:34:00 INFO kafka.KafkaInputDStream: Removing blocks of RDD
> BlockRDD[120958] at createStream at SparkStreamingEngine.java:40 of time
> 145141044 ms
>
> 15/12/29 11:34:00 INFO rdd.MapPartitionsRDD: Removing RDD 120959 from
> persistence list
>
> 15/12/29 11:34:00 INFO storage.BlockManager: Removing RDD 120959
>
> 15/12/29 11:34:00 INFO scheduler.ReceivedBlockTracker: Deleting batches
> ArrayBuffer(145141042 ms)
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Added jobs for time
> 145141045 ms
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
> 145141045 ms.0 from job set of time 145141045 ms
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
> 145141045 ms.0 from job set of time 145141045 ms
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Starting job streaming job
> 145141045 ms.1 from job set of time 145141045 ms
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Finished job streaming job
> 145141045 ms.1 from job set of time 145141045 ms
>
> 15/12/29 11:34:10 INFO rdd.BlockRDD: Removing RDD 120960 from persistence
> list
>
> 15/12/29 11:34:10 INFO scheduler.JobScheduler: Total delay: 0.004 s for
> time 145141045 ms (execution: 0.001 s)
>
> 15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120960
>
> 15/12/29 11:34:10 INFO kafka.KafkaInputDStream: Removing blocks of RDD
> BlockRDD[120960] at createStream at SparkStreamingEngine.java:40 of time
> 145141045 ms
>
> 15/12/29 11:34:10 INFO rdd.MapPartitionsRDD: Removing RDD 120961 from
> persistence list
>
> 15/12/29 11:34:10 INFO storage.BlockManager: Removing RDD 120961
>
> 15/12/29 11:34:10 INFO scheduler.ReceivedBlockTracker: Deleting batches
> ArrayBuffer(145141043 ms)
>
> 15/12/29 11:34:13 WARN security.UserGroupInformation:
> PriviledgedActionException as:s (auth:SIMPLE)
> cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
> token (HDFS_DELEGATION_TOKEN token 3104414 for s) is expired
>
> 15/12/29 11:34:13 *WARN ipc.Client: Exception encountered while
> connecting to the server* :
> 

Re: Unable to run spark SQL Join query.

2016-01-06 Thread ๏̯͡๏
Any suggestions on how to do joins in Spark SQL. Above Spark SQL
format/Syntax is not working.

On Mon, Jan 4, 2016 at 2:33 AM, ÐΞ€ρ@Ҝ (๏̯͡๏)  wrote:

> There are three tables in action here.
>
> Table A (success_events.sojsuccessevents1) JOIN TABLE B (dw_bid) to
> create TABLE C (sojsuccessevents2_spark)
>
> Now table success_events.sojsuccessevents1 has itemid that i confirmed by
> running describe success_events.sojsuccessevents1 from spark-sql shell.
>
> I changed my join query to use itemid.
>
> " on a.itemid = b.item_id  and  a.transactionid =  b.transaction_id " +
>
> But still i get the same error
>
>
> 16/01/04 03:29:27 INFO yarn.ApplicationMaster: Final app status: FAILED,
> exitCode: 15, (reason: User class threw exception:
> org.apache.spark.sql.AnalysisException: cannot resolve 'a.itemid' given
> input columns bid_flags, slng_chnl_id, upd_user, bdr_id, bid_status_id,
> bid_dt, transaction_id, host_ip_addr, upd_date, item_vrtn_id, auct_end_dt,
> bid_amt_unit_lstg_curncy, bidding_site_id, cre_user, bid_cobrand_id,
> bdr_site_id, app_id, lstg_curncy_id, bid_exchng_rate, bid_date,
> ebx_bid_yn_id, cre_date, winning_qty, bid_type_code, half_on_ebay_bid_id,
> bdr_cntry_id, qty_bid, item_id; line 1 pos 864)
>
> It appears as if its trying to look for itemid in TABLE B (dw_bid) instead
> of TABLE A (success_events.sojsuccessevents1) As above columns are from
> TABLE B.
>
> Regards,
> Deepak
>
>
>
>
> On Sun, Jan 3, 2016 at 7:42 PM, Jins George  wrote:
>
>> Column 'itemId' is not present in table 'success_events.sojsuccessevents1'
>> or  'dw_bid'
>>
>> did you mean  'sojsuccessevents2_spark' table  in your select query ?
>>
>> Thanks,
>> Jins
>>
>> On 01/03/2016 07:22 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
>>
>> Code:
>>
>> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
>>
>> hiveContext.sql("drop table sojsuccessevents2_spark")
>>
>> hiveContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid`
>> string COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from
>> deserializer', `sessionstartdate` string COMMENT 'from deserializer',
>> `sojdatadate` string COMMENT 'from deserializer', `seqnum` int COMMENT
>> 'from deserializer', `eventtimestamp` string COMMENT 'from deserializer',
>> `siteid` int COMMENT 'from deserializer', `successeventtype` string COMMENT
>> 'from deserializer', `sourcetype` string COMMENT 'from deserializer',
>> `itemid` bigint COMMENT 'from deserializer', `shopcartid` bigint COMMENT
>> 'from deserializer', `transactionid` bigint COMMENT 'from deserializer',
>> `offerid` bigint COMMENT 'from deserializer', `userid` bigint COMMENT 'from
>> deserializer', `priorpage1seqnum` int COMMENT 'from deserializer',
>> `priorpage1pageid` int COMMENT 'from deserializer',
>> `exclwmsearchattemptseqnum` int COMMENT 'from deserializer',
>> `exclpriorsearchpageid` int COMMENT 'from deserializer',
>> `exclpriorsearchseqnum` int COMMENT 'from deserializer',
>> `exclpriorsearchcategory` int COMMENT 'from deserializer',
>> `exclpriorsearchl1` int COMMENT 'from deserializer', `exclpriorsearchl2`
>> int COMMENT 'from deserializer', `currentimpressionid` bigint COMMENT 'from
>> deserializer', `sourceimpressionid` bigint COMMENT 'from deserializer',
>> `exclpriorsearchsqr` string COMMENT 'from deserializer',
>> `exclpriorsearchsort` string COMMENT 'from deserializer', `isduplicate` int
>> COMMENT 'from deserializer', `transactiondate` string COMMENT 'from
>> deserializer', `auctiontypecode` int COMMENT 'from deserializer', `isbin`
>> int COMMENT 'from deserializer', `leafcategoryid` int COMMENT 'from
>> deserializer', `itemsiteid` int COMMENT 'from deserializer', `bidquantity`
>> int COMMENT 'from deserializer', `bidamtusd` double COMMENT 'from
>> deserializer', `offerquantity` int COMMENT 'from deserializer',
>> `offeramountusd` double COMMENT 'from deserializer', `offercreatedate`
>> string COMMENT 'from deserializer', `buyersegment` string COMMENT 'from
>> deserializer', `buyercountryid` int COMMENT 'from deserializer', `sellerid`
>> bigint COMMENT 'from deserializer', `sellercountryid` int COMMENT 'from
>> deserializer', `sellerstdlevel` string COMMENT 'from deserializer',
>> `csssellerlevel` string COMMENT 'from deserializer', `experimentchannel`
>> int COMMENT 'from deserializer') ROW FORMAT SERDE
>> 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT
>> 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION
>> 'hdfs://
>> apollo-phx-nn.vip.ebay.com:8020/user/dvasthimal/spark/successeventstaging/sojsuccessevents2'
>> TBLPROPERTIES ( 'avro.schema.literal'='{\"type\":\"record\",\"name\":\"
>> success\",\"namespace\":\"Reporting.detail\",\"doc\":\"\",\"fields\":[{\"
>> name\":\"guid\",\"type\":{\"type\":\"string\",\"avro.java.string\":\"
>> String\"},\"doc\":\"\",\"default\":\"\"},{\"name\":\"sessionKey\",\"type
>> 

Re: problem building spark on centos

2016-01-06 Thread Marcelo Vanzin
If you're trying to compile against Scala 2.11, you're missing
"-Dscala-2.11" in that command.

On Wed, Jan 6, 2016 at 12:27 PM, Jade Liu  wrote:
> Hi, Todd:
>
> Thanks for your suggestion. Yes I did run the ./dev/change-scala-version.sh
> 2.11 script when using scala version 2.11.
>
> I just tried this as you suggested:
> ./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6
> -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests

-- 
Marcelo

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: problem building spark on centos

2016-01-06 Thread Jade Liu
Yes I’m using maven 3.3.9.


From: Todd Nist >
Date: Wednesday, January 6, 2016 at 12:33 PM
To: Jade Liu >
Cc: "user@spark.apache.org" 
>
Subject: Re: problem building spark on centos

Not sure, I just built it with java 8, but 7 is supported so that should be 
fine.  Are you using maven 3.3.3 + ?

RADTech:spark-1.5.2 tnist$ mvn -version
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
support was removed in 8.0
Apache Maven 3.3.3 (7994120775791599e205a5524ec3e0dfe41d4a06; 
2015-04-22T07:57:37-04:00)
Maven home: /usr/local/maven
Java version: 1.8.0_51, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_51.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.10.5", arch: "x86_64", family: "mac"



On Wed, Jan 6, 2016 at 3:27 PM, Jade Liu 
> wrote:
Hi, Todd:

Thanks for your suggestion. Yes I did run the ./dev/change-scala-version.sh 
2.11 script when using scala version 2.11.

I just tried this as you suggested:
./make-distribution.sh --name hadoop-2.6 --tgz -Pyarn -Phadoop-2.6 
-Dhadoop.version=2.6.0 -Phive -Phive-thriftserver –DskipTests

Still got the same error:
[ERROR] Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed. CompileFailed -> 
[Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile (scala-compile-first) on 
project spark-launcher_2.10: Execution scala-compile-first of goal 
net.alchim31.maven:scala-maven-plugin:3.2.2:compile failed.
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at 
org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at 
org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:863)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:288)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:199)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at 
org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.PluginExecutionException: Execution 
scala-compile-first of goal net.alchim31.maven:scala-maven-plugin:3.2.2:compile 
failed.
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:145)
at 
org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:207)
... 20 more
Caused by: Compile failed via zinc server
at sbt_inc.SbtIncrementalCompiler.zincCompile(SbtIncrementalCompiler.java:136)
at sbt_inc.SbtIncrementalCompiler.compile(SbtIncrementalCompiler.java:86)
at 
scala_maven.ScalaCompilerSupport.incrementalCompile(ScalaCompilerSupport.java:303)
at scala_maven.ScalaCompilerSupport.compile(ScalaCompilerSupport.java:119)
at scala_maven.ScalaCompilerSupport.doExecute(ScalaCompilerSupport.java:99)
at scala_maven.ScalaMojoSupport.execute(ScalaMojoSupport.java:482)
at 
org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
... 21 more
[ERROR]
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
http://cwiki.apache.org/confluence/display/MAVEN/PluginExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn  -rf :spark-launcher_2.10

Do you think it’s java problem? I’m using 

Re: pyspark dataframe: row with a minimum value of a column for each group

2016-01-06 Thread Kristina Rogale Plazonic
Try redefining your window, without sortBy part. In other words, rerun your
code with

window = Window.partitionBy("a")

The thing is that the window is defined differently in these two cases. In
your example, in the group where "a" is 1,

  - If you include "sortBy" option, it is a rolling window:
   - 1st min is computed on the first row in this group,
   - 2nd min is computed on the first 2 rows in this group,
   - 3rd min is computed on the first 3 rows in this group

  - if you don't include the sortBy option, min is computed on a constant
window of width 3.

On Wed, Jan 6, 2016 at 2:34 PM, Wei Chen  wrote:

> Thank you. I have tried the window function as follows:
>
> import pyspark.sql.functions as f
> sqc = sqlContext
> from pyspark.sql import Window
> import pandas as pd
>
> DF = pd.DataFrame({'a': [1,1,1,2,2,2,3,3,3],
>'b': [1,2,3,1,2,3,1,2,3],
>'c': [1,2,3,4,5,6,7,8,9]
>   })
>
> df = sqc.createDataFrame(DF)
>
> window = Window.partitionBy("a").orderBy("c")
>
> df.select('a', 'b', 'c', f.min('c').over(window).alias('y')).show()
>
> I got the following result which is understandable:
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  1|
> |  1|  3|  3|  1|
> |  2|  1|  4|  4|
> |  2|  2|  5|  4|
> |  2|  3|  6|  4|
> |  3|  1|  7|  7|
> |  3|  2|  8|  7|
> |  3|  3|  9|  7|
> +---+---+---+---+
>
>
> However if I change min to max, the result is not what is expected:
>
> df.select('a', 'b', 'c', f.max('c').over(window).alias('y')).show() gives
>
> +---+---+---+---+
> |  a|  b|  c|  y|
> +---+---+---+---+
> |  1|  1|  1|  1|
> |  1|  2|  2|  2|
> |  1|  3|  3|  3|
> |  2|  1|  4|  4|
> |  2|  2|  5|  5|
> |  2|  3|  6|  6|
> |  3|  1|  7|  7|
> |  3|  2|  8|  8|
> |  3|  3|  9|  9|
> +---+---+---+---+
>
>
>
> Thanks,
>
> Wei
>
>
> On Tue, Jan 5, 2016 at 8:30 PM, ayan guha  wrote:
>
>> Yes there is. It is called window function over partitions.
>>
>> Equivalent SQL would be:
>>
>> select * from
>>  (select a,b,c, rank() over (partition by a order by b) r from
>> df) x
>> where r = 1
>>
>> You can register your DF as a temp table and use the sql form. Or,
>> (>Spark 1.4) you can use window methods and their variants in Spark SQL
>> module.
>>
>> HTH
>>
>> On Wed, Jan 6, 2016 at 11:56 AM, Wei Chen 
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to retrieve the rows with a minimum value of a column for
>>> each group. For example: the following dataframe:
>>>
>>> a | b | c
>>> --
>>> 1 | 1 | 1
>>> 1 | 2 | 2
>>> 1 | 3 | 3
>>> 2 | 1 | 4
>>> 2 | 2 | 5
>>> 2 | 3 | 6
>>> 3 | 1 | 7
>>> 3 | 2 | 8
>>> 3 | 3 | 9
>>> --
>>>
>>> I group by 'a', and want the rows with the smallest 'b', that is, I want
>>> to return the following dataframe:
>>>
>>> a | b | c
>>> --
>>> 1 | 1 | 1
>>> 2 | 1 | 4
>>> 3 | 1 | 7
>>> --
>>>
>>> The dataframe I have is huge so get the minimum value of b from each
>>> group and joining on the original dataframe is very expensive. Is there a
>>> better way to do this?
>>>
>>>
>>> Thanks,
>>> Wei
>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Wei Chen, Ph.D.
> Astronomer and Data Scientist
> Phone: (832)646-7124
> Email: wei.chen.ri...@gmail.com
> LinkedIn: https://www.linkedin.com/in/weichen1984
>