what's the way to access the last element from another partition

2015-12-08 Thread Zhiliang Zhu
In some given partition, it seems difficult to access the last element in 
another partition, but in my application I need do as that.Exactly how to do it 
? 
Just by repartition /shuffle  the rdd into one partition and get the specific 
"last" element ? Will this will change the previous order among the elements, 
and will it also not work ?
Thanks very much in advance!  

On Monday, December 7, 2015 11:32 AM, Zhiliang Zhu  
wrote:
 

  


On Monday, December 7, 2015 10:37 AM, DB Tsai  wrote:
 

 Only beginning and ending part of data. The rest in the partition can
be compared without shuffle.


Would you help write a few  pseudo-code about it...It seems that there is not 
shuffle related  API , or repartition ?
Thanks a lot in advance!






Sincerely,

DB Tsai
--
Web: https://www.dbtsai.com
PGP Key ID: 0xAF08DF8D


On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu  wrote:
>
>
>
>
> On Saturday, December 5, 2015 3:00 PM, DB Tsai  wrote:
>
>
> This is tricky. You need to shuffle the ending and beginning elements
> using mapPartitionWithIndex.
>
>
> Does this mean that I need to shuffle the all elements in different
> partitions into one partition, then compare them by way of any two adjacent
> elements?
> It seems good, if it is like that.
>
> One more issue, will it loss parallelism since there become only one
> partition ...
>
> Thanks very much in advance!
>
>
>
>
>
>
> Sincerely,
>
> DB Tsai
> --
> Web: https://www.dbtsai.com
> PGP Key ID: 0xAF08DF8D
>
>
> On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu  wrote:
>> Hi All,
>>
>> I would like to compare any two adjacent elements in one given rdd, just
>> as
>> the single machine code part:
>>
>> int a[N] = {...};
>> for (int i=0; i < N - 1; ++i) {
>>    compareFun(a[i], a[i+1]);
>> }
>> ...
>>
>> mapPartitions may work for some situations, however, it could not compare
>> elements in different  partitions.
>> foreach also seems not work.
>>
>> Thanks,
>> Zhiliang
>
>>
>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



   

  

Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu

what’s your data format? ORC or CSV or others?

val keys = sqlContext.read.orc(“your previous batch data 
path”).select($”uniq_key”).collect
val broadCast = sc.broadCast(keys)

val rdd = your_current_batch_data
rdd.filter( line => line.key  not in broadCase.value)






> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>   
> 
> 
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu  > wrote:
> Can you detail your question?  what looks like your previous batch and the 
> current batch?
> 
> 
> 
> 
> 
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V > > wrote:
>> 
>> Hi,
>> 
>> I'm running java over spark in cluster mode. I want to apply filter on 
>> javaRDD based on some previous batch values. if i store those values in 
>> mapDB, is it possible to apply filter during the current batch ?
>> 
>> Thanks,
>> 
>>   
>> 
> 
> 



Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Im running spark batch job in cluster mode every hour and it runs for 15
minutes. I have certain unique keys in the dataset. i dont want to process
those keys during my next hour batch.

*Thanks*,



On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
wrote:

> Can you detail your question?  what looks like your previous batch and the
> current batch?
>
>
>
>
>
> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>
> Hi,
>
> I'm running java over spark in cluster mode. I want to apply filter on
> javaRDD based on some previous batch values. if i store those values in
> mapDB, is it possible to apply filter during the current batch ?
>
> *Thanks*,
> 
>
>
>


Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Pipe separated value. I know broadcast and join works. but i would like to
know mapDB works or not ?

*Thanks*,



On Tue, Dec 8, 2015 at 2:22 PM, Fengdong Yu 
wrote:

>
> what’s your data format? ORC or CSV or others?
>
> val keys = sqlContext.read.orc(“your previous batch data
> path”).select($”uniq_key”).collect
> val broadCast = sc.broadCast(keys)
>
> val rdd = your_current_batch_data
> rdd.filter( line => line.key  not in broadCase.value)
>
>
>
>
>
>
> On Dec 8, 2015, at 4:44 PM, Ramkumar V  wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset. i dont want to process
> those keys during my next hour batch.
>
> *Thanks*,
> 
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
> wrote:
>
>> Can you detail your question?  what looks like your previous batch and
>> the current batch?
>>
>>
>>
>>
>>
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>
>> Hi,
>>
>> I'm running java over spark in cluster mode. I want to apply filter on
>> javaRDD based on some previous batch values. if i store those values in
>> mapDB, is it possible to apply filter during the current batch ?
>>
>> *Thanks*,
>> 
>>
>>
>>
>
>


Re: parquet file doubts

2015-12-08 Thread Cheng Lian
Cc'd Parquet dev list. At first I expected to discuss this issue on 
Parquet dev list but sent to the wrong mailing list. However, I think 
it's OK to discuss it here since lots of Spark users are using Parquet 
and this information should be generally useful here.


Comments inlined.

On 12/7/15 10:34 PM, Shushant Arora wrote:

how to read it using parquet tools.
When I did
hadoop parquet.tools.Main meta prquetfilename

I didn't get any info of min and max values.
Didn't realize that you meant to inspect min/max values since what you 
asked was how to inspect the version of Parquet library that is used to 
generate the Parquet file.


Currently parquet-tools doesn't print min/max statistics information. 
I'm afraid you'll have to do it programmatically.
How can I see parquet version of my file.Is min max respective to some 
parquet version or available since beginning?
AFAIK, it was added in 1.5.0 
https://github.com/apache/parquet-mr/blob/parquet-1.5.0/parquet-column/src/main/java/parquet/column/statistics/Statistics.java


But I failed to find corresponding JIRA ticket or pull request for this.



On Mon, Dec 7, 2015 at 6:51 PM, Singh, Abhijeet 
> wrote:


Yes, Parquet has min/max.

*From:*Cheng Lian [mailto:l...@databricks.com
]
*Sent:* Monday, December 07, 2015 11:21 AM
*To:* Ted Yu
*Cc:* Shushant Arora; user@spark.apache.org

*Subject:* Re: parquet file doubts

Oh sorry... At first I meant to cc spark-user list since Shushant
and I had been discussed some Spark related issues before. Then I
realized that this is a pure Parquet issue, but forgot to change
the cc list. Thanks for pointing this out! Please ignore this thread.

Cheng

On 12/7/15 12:43 PM, Ted Yu wrote:

Cheng:

I only see user@spark in the CC.

FYI

On Sun, Dec 6, 2015 at 8:01 PM, Cheng Lian
> wrote:

cc parquet-dev list (it would be nice to always do so for
these general questions.)

Cheng

On 12/6/15 3:10 PM, Shushant Arora wrote:

Hi

I have few doubts on parquet file format.

1.Does parquet keeps min max statistics like in ORC. how can I
see parquet version(whether its1.1,1.2or1.3) for parquet file
generated using hive or custom MR or AvroParquetoutputFormat.

Yes, Parquet also keeps row group statistics. You may check
the Parquet file using the parquet-meta CLI tool in
parquet-tools (see
https://github.com/Parquet/parquet-mr/issues/321 for details),
then look for the "creator" field of the file. For
programmatic access, check for
o.a.p.hadoop.metadata.FileMetaData.createdBy.


2.how to sort parquet records while generating parquet file
using avroparquetoutput format?

AvroParquetOutputFormat is not a format. It's just responsible
for converting Avro records to Parquet records. How are you
using AvroParquetOutputFormat? Any example snippets?


Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org







bad performance on PySpark - big text file

2015-12-08 Thread patcharee

Hi,

I am very new to PySpark. I have a PySpark app working on text files 
with different size (100M - 100G). However each task is handling the 
same size of input split. But workers spend very much longer time on 
some input splits, especially when the input splits belong to a big 
file. See the log of these two input splits (check python.PythonRunner: 
Times: total ... )


15/12/08 07:37:15 INFO rdd.NewHadoopRDD: Input split: 
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/budisansblog.blogspot.com.html:39728447488+134217728
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4335010, boot 
= -140, init = 282, finish = 4334868
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(125163) 
called with curMem=227636200, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_3_1772 stored as 
bytes in memory (estimated size 122.2 KB, free 3.8 GB)
15/12/08 08:49:30 INFO python.PythonRunner: Times: total = 4, boot = 1, 
init = 0, finish = 3
15/12/08 08:49:30 INFO storage.MemoryStore: ensureFreeSpace(126595) 
called with curMem=227761363, maxMem=4341293383
15/12/08 08:49:30 INFO storage.MemoryStore: Block rdd_9_1772 stored as 
bytes in memory (estimated size 123.6 KB, free 3.8 GB)
15/12/08 08:49:30 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
15/12/08 08:49:30 INFO datasources.DynamicPartitionWriterContainer: 
Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/08 08:49:30 INFO output.FileOutputCommitter: Saved output of task 
'attempt_201512080849_0002_m_001772_0' to 
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512080849_0002_m_001772
15/12/08 08:49:30 INFO mapred.SparkHadoopMapRedUtil: 
attempt_201512080849_0002_m_001772_0: Committed
15/12/08 08:49:30 INFO executor.Executor: Finished task 1772.0 in stage 
2.0 (TID 1770). 16216 bytes result sent to driver



15/12/07 20:52:24 INFO rdd.NewHadoopRDD: Input split: 
hdfs://helmhdfs/user/patcharee/ntap-raw-20151015-20151126/html2/bcnn1wp.wordpress.com.html:1476395008+134217728
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 41776, boot = 
-425, init = 432, finish = 41769
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1434614) 
called with curMem=167647961, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_3_994 stored as 
bytes in memory (estimated size 1401.0 KB, free 3.9 GB)
15/12/07 20:53:06 INFO python.PythonRunner: Times: total = 40, boot = 
-20, init = 21, finish = 39
15/12/07 20:53:06 INFO storage.MemoryStore: ensureFreeSpace(1463477) 
called with curMem=169082575, maxMem=4341293383
15/12/07 20:53:06 INFO storage.MemoryStore: Block rdd_9_994 stored as 
bytes in memory (estimated size 1429.2 KB, free 3.9 GB)
15/12/07 20:53:06 INFO output.FileOutputCommitter: File Output Committer 
Algorithm version is 1
15/12/07 20:53:06 INFO datasources.DynamicPartitionWriterContainer: 
Using output committer class 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
15/12/07 20:53:06 INFO output.FileOutputCommitter: Saved output of task 
'attempt_201512072053_0002_m_000994_0' to 
hdfs://helmhdfs/user/patcharee/NTAPBlogInfo/_temporary/0/task_201512072053_0002_m_000994
15/12/07 20:53:06 INFO mapred.SparkHadoopMapRedUtil: 
attempt_201512072053_0002_m_000994_0: Committed
15/12/07 20:53:06 INFO executor.Executor: Finished task 994.0 in stage 
2.0 (TID 990). 9386 bytes result sent to driver


Any suggestions please

Thanks,
Patcharee




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark with MapDB

2015-12-08 Thread Fengdong Yu
Can you detail your question?  what looks like your previous batch and the 
current batch?





> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
> 
> Hi,
> 
> I'm running java over spark in cluster mode. I want to apply filter on 
> javaRDD based on some previous batch values. if i store those values in 
> mapDB, is it possible to apply filter during the current batch ?
> 
> Thanks,
> 
>   
> 



Re: Spark with MapDB

2015-12-08 Thread Jörn Franke
You may want to use a bloom filter for this, but make sure that you understand 
how it works

> On 08 Dec 2015, at 09:44, Ramkumar V  wrote:
> 
> Im running spark batch job in cluster mode every hour and it runs for 15 
> minutes. I have certain unique keys in the dataset. i dont want to process 
> those keys during my next hour batch.
> 
> Thanks,
> 
>  
> 
> 
>> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu  wrote:
>> Can you detail your question?  what looks like your previous batch and the 
>> current batch?
>> 
>> 
>> 
>> 
>> 
>>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm running java over spark in cluster mode. I want to apply filter on 
>>> javaRDD based on some previous batch values. if i store those values in 
>>> mapDB, is it possible to apply filter during the current batch ?
>>> 
>>> Thanks,
>>> 
>>>  
>>> 
> 


Logging spark output to hdfs file

2015-12-08 Thread sunil m
Hi!
I configured log4j.properties file in conf folder of spark with following
values...

log4j.appender.file.File=hdfs://

I expected all log files to log output to the file in HDFS.
Instead files are created locally.

Has anybody tried logging to HDFS by configuring log4j.properties?

Warm regards,
Sunil M


Re: HiveContext creation failed with Kerberos

2015-12-08 Thread Steve Loughran



On 8 Dec 2015, at 06:52, Neal Yin 
> wrote:

15/12/08 04:12:28 ERROR transport.TSaslTransport: SASL negotiation failure
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: 
No valid credentials provided (Mechanism level: Failed to find any Kerberos 
tgt)]

lots of causes for that, its one of the two classic "kerberos doesn't like you" 
error messages

https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/sections/errors.html

in your case it sounds like 1 or more of these issues

https://issues.apache.org/jira/browse/SPARK-10181
https://issues.apache.org/jira/browse/SPARK-11821
https://issues.apache.org/jira/browse/SPARK-11265

All of which are fixed in 1.5.3. Which doesn't help you with the CDH 1.5.2 
release, unless they backported things

-Steve


Re: Logging spark output to hdfs file

2015-12-08 Thread Jörn Franke
This would require a special HDFS log4j appender. Alternatively try the flume 
log4j appender

> On 08 Dec 2015, at 13:00, sunil m <260885smanik...@gmail.com> wrote:
> 
> Hi!
> I configured log4j.properties file in conf folder of spark with following 
> values...
> 
> log4j.appender.file.File=hdfs://
> 
> I expected all log files to log output to the file in HDFS. 
> Instead files are created locally.
> 
> Has anybody tried logging to HDFS by configuring log4j.properties?
> 
> Warm regards,
> Sunil M

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Comparisons between Ganglia and Graphite for monitoring the Streaming Cluster?

2015-12-08 Thread SRK
Hi,

What are the comparisons between Ganglia and Graphite to monitor the
Streaming Cluster? Which one has more advantages over the other?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Comparisons-between-Ganglia-and-Graphite-for-monitoring-the-Streaming-Cluster-tp25635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



is repartition very cost

2015-12-08 Thread Zhiliang Zhu

Hi All,
I need to do optimize objective function with some linear constraints by  
genetic algorithm. I would like to make as much parallelism for it by spark.
repartition / shuffle may be used sometimes in it, however, is repartition API 
very cost ?
Thanks in advance!Zhiliang



Re: Spark with MapDB

2015-12-08 Thread Ramkumar V
Yes, I agree but the data is in the form of RDD and also im running it
cluster mode so the data should be distributed across all machines in the
cluster. but if i use bloom filter or mapDB which is non distributed. How
will it works in this case ?

*Thanks*,



On Tue, Dec 8, 2015 at 5:30 PM, Jörn Franke  wrote:

> You may want to use a bloom filter for this, but make sure that you
> understand how it works
>
> On 08 Dec 2015, at 09:44, Ramkumar V  wrote:
>
> Im running spark batch job in cluster mode every hour and it runs for 15
> minutes. I have certain unique keys in the dataset. i dont want to process
> those keys during my next hour batch.
>
> *Thanks*,
> 
>
>
> On Tue, Dec 8, 2015 at 1:42 PM, Fengdong Yu 
> wrote:
>
>> Can you detail your question?  what looks like your previous batch and
>> the current batch?
>>
>>
>>
>>
>>
>> On Dec 8, 2015, at 3:52 PM, Ramkumar V  wrote:
>>
>> Hi,
>>
>> I'm running java over spark in cluster mode. I want to apply filter on
>> javaRDD based on some previous batch values. if i store those values in
>> mapDB, is it possible to apply filter during the current batch ?
>>
>> *Thanks*,
>> 
>>
>>
>>
>


PySpark reading from Postgres tables with UUIDs

2015-12-08 Thread Chris Elsmore
Hi All,

I’m currently having some issues getting Spark to read from Postgres tables 
which have uuid type columns through a PySpark shell.

I can connect and see tables which do not have a uuid column but get the error 
"java.sql.SQLException: Unsupported type " when I try to get a table which 
does have uuid column. Is there anyway I can access these?

See the pastebin: http://pastebin.com/VbpU4uRU  
for more info and the PySpark shell readout.

Am using Postgres 9.4, Spark 1.5.1, Java OpenJDK 1.7.0_79, JDBC 
postgresql-9.4-1206-jdbc41

Chris

Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi,

Should the gmond be installed in all the Spark nodes? What should the host
and port be? Should it be the host and port of gmetad?

 Enable GangliaSink for all instances 
*.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink 
*.sink.ganglia.name=hadoop_cluster1 
*.sink.ganglia.host=localhost 
*.sink.ganglia.port=8653 
*.sink.ganglia.period=10 
*.sink.ganglia.unit=seconds 
*.sink.ganglia.ttl=1 
*.sink.ganglia.mode=multicast 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p25636.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



flatMap function in Spark

2015-12-08 Thread Sateesh Karuturi
Guys... I am new to Spark..
Please anyone please explain me how flatMap function works with a little
sample example...
Thanks in advance...


Re: flatMap function in Spark

2015-12-08 Thread Gerard Maas
http://stackoverflow.com/search?q=%5Bapache-spark%5D+flatmap

-kr, Gerard.

On Tue, Dec 8, 2015 at 12:04 PM, Sateesh Karuturi <
sateesh.karutu...@gmail.com> wrote:

> Guys... I am new to Spark..
> Please anyone please explain me how flatMap function works with a little
> sample example...
> Thanks in advance...
>


Re: Spark SQL - saving to multiple partitions in parallel - FileNotFoundException on _temporary directory possible bug?

2015-12-08 Thread Jiří Syrový
Hi,

I have a very similar issue on standalone SQL context, but when using
save() instead. I guess it might be related to
https://issues.apache.org/jira/browse/SPARK-8513. Also it usually happens
after using groupBy.

Regrads,
Jiri

2015-12-08 0:16 GMT+01:00 Deenar Toraskar :

> Hi
>
> I have a process that writes to multiple partitions of the same table in
> parallel using multiple threads sharing the same SQL context
> df.write.partitionedBy("partCol").insertInto("tableName") . I am
> getting FileNotFoundException on _temporary directory. Each write only goes
> to a single partition, is there some way of not using dynamic partitioning
> using df.write without having to resort to .save as I dont want to hardcode
> a physical DFS location in my code?
>
> This is similar to this issue listed here
> https://issues.apache.org/jira/browse/SPARK-2984
>
> Regards
> Deenar
>
> *Think Reactive Ltd*
> deenar.toras...@thinkreactive.co.uk
>
>
>
>


Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Tao Li
I am using spark streaming kafka direct approach these days. I found that
when I start the application, it always start consumer the latest offset. I
hope that when application start, it consume from the offset last
application consumes with the same kafka consumer group. It means I have to
maintain the consumer offset by my self, for example record it on
zookeeper, and reload the last offset from zookeeper when restarting the
applicaiton?

I see the following discussion:
https://github.com/apache/spark/pull/4805
https://issues.apache.org/jira/browse/SPARK-6249

Is there any conclusion? Do we need to maintain the offset by myself? Or
spark streaming will support a feature to simplify the offset maintain work?

https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html


RE: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Singh, Abhijeet
You need to maintain the offset yourself and rightly so in something like 
ZooKeeper.

From: Tao Li [mailto:litao.bupt...@gmail.com]
Sent: Tuesday, December 08, 2015 5:36 PM
To: user@spark.apache.org
Subject: Need to maintain the consumer offset by myself when using spark 
streaming kafka direct approach?

I am using spark streaming kafka direct approach these days. I found that when 
I start the application, it always start consumer the latest offset. I hope 
that when application start, it consume from the offset last application 
consumes with the same kafka consumer group. It means I have to maintain the 
consumer offset by my self, for example record it on zookeeper, and reload the 
last offset from zookeeper when restarting the applicaiton?

I see the following discussion:
https://github.com/apache/spark/pull/4805
https://issues.apache.org/jira/browse/SPARK-6249

Is there any conclusion? Do we need to maintain the offset by myself? Or spark 
streaming will support a feature to simplify the offset maintain work?

https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html


Re: understanding and disambiguating CPU-core related properties

2015-12-08 Thread Manolis Sifalakis1
Thanks lots for the pointer! Helpful even though a bit layman's style.

(On the nagging end, this information as usual with spark, is not were it 
is expected to be: neither the book nor the spark doc)

m.



From:   Leonidas Patouchas 
To: Manolis Sifalakis1 
Cc: user@spark.apache.org
Date:   04/12/2015 18:03
Subject:Re: understanding and disambiguating CPU-core related 
properties



Regarding your 2nd question, there is great article 
from cloudera regurding this:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2
. They focus on yarn setup but the big picture applies everywere.
In general, I believe that you have to know your data in order to 
configure a-priory those set of params. From my experience the more cpu's 
the merrier. I have noticed that i.e if i double the cpu's the job 
finishes in half the time. This effect though does not have the same 
analogy (double cps - half time) after I reach a specific number of cpu's 
(always depending on the data and the job's actions). So it has a lot of 
try and observe.  
In addition, there is a tight connection between cpu's and partitions. 
Cloudera's article covers this.
Regards,
Leonidas

On Thu, Dec 3, 2015 at 5:44 PM, Manolis Sifalakis1  
wrote:
I have found the documentation rather poor in helping me understand the
interplay among the following properties in spark, even more how to set
them. So this post is sent in hope for some discussion and "enlightenment"
on the topic

Let me start by asking if I have understood well the following:

- spark.driver.cores:   how many cores the driver program should occupy
- spark.cores.max:   how many cores my app will claim for computations
- spark.executor.cores and spark.task.cpus:   how spark.cores.max are
allocated per JVM (executor) and per task (java thread?)
  I.e. + spark.executor.cores:   each JVM instance (executor) should use
that many cores
+ spark.task.cpus: each task shoudl occupy max this # or cores

If so far good, then...

q1: Is spark.cores.max inclusive or not of spark.driver.cores ?

q1: How should one decide statically a-priori how to distribute the
spark.cores.max to JVMs and task ?

q3: Since the set-up of cores-per-worker restricts how many cores can be
max avail per executor and since an executor cannot spawn across workers,
what is the rationale behind an application claiming cores
(spark.cores.max) as opposed to merely executors ? (This will make an app
never fail to be admitted)

TIA for any clarifications/intuitions/experiences on the topic

best

Manolis.




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread Dibyendu Bhattacharya
In direct stream checkpoint location is not recoverable if you modify your
driver code. So if you just rely on checkpoint to commit offset , you can
possibly loose messages if you modify driver code and you select  offset
from "largest" offset. If you do not want to loose messages,  you need to
commit offset to external store in case of direct stream.

On Tue, Dec 8, 2015 at 7:47 PM, PhuDuc Nguyen 
wrote:

> Kafka Receiver-based approach:
> This will maintain the consumer offsets in ZK for you.
>
> Kafka Direct approach:
> You can use checkpointing and that will maintain consumer offsets for you.
> You'll want to checkpoint to a highly available file system like HDFS or S3.
>
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing
>
> You don't have to maintain your own offsets if you don't want to. If the 2
> solutions above don't satisfy your requirements, then consider writing your
> own; otherwise I would recommend using the supported features in Spark.
>
> HTH,
> Duc
>
>
>
> On Tue, Dec 8, 2015 at 5:05 AM, Tao Li  wrote:
>
>> I am using spark streaming kafka direct approach these days. I found that
>> when I start the application, it always start consumer the latest offset. I
>> hope that when application start, it consume from the offset last
>> application consumes with the same kafka consumer group. It means I have to
>> maintain the consumer offset by my self, for example record it on
>> zookeeper, and reload the last offset from zookeeper when restarting the
>> applicaiton?
>>
>> I see the following discussion:
>> https://github.com/apache/spark/pull/4805
>> https://issues.apache.org/jira/browse/SPARK-6249
>>
>> Is there any conclusion? Do we need to maintain the offset by myself? Or
>> spark streaming will support a feature to simplify the offset maintain work?
>>
>>
>> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
>>
>
>


Re: hive thriftserver and fair scheduling

2015-12-08 Thread Deenar Toraskar
Thanks Michael, I'll try it out. Another quick/important question: How do I
make udfs available to all of the hive thriftserver users? Right now, when
I launch a spark-sql client, I notice that it reads the ~/.hiverc file and
all udfs get picked up but this doesn't seem to be working in hive
thriftserver.
Is there a way to make it work in a similar way for all users in hive
 thriftserver?

+1 for this request


On 20 October 2015 at 23:49, Sadhan Sood  wrote:

> Thanks Michael, I'll try it out. Another quick/important question: How do
> I make udfs available to all of the hive thriftserver users? Right now,
> when I launch a spark-sql client, I notice that it reads the ~/.hiverc file
> and all udfs get picked up but this doesn't seem to be working in hive
> thriftserver. Is there a way to make it work in a similar way for all users
> in hive thriftserver?
>
> Thanks again!
>
> On Tue, Oct 20, 2015 at 12:34 PM, Michael Armbrust  > wrote:
>
>> Not the most obvious place in the docs... but this is probably helpful:
>> https://spark.apache.org/docs/latest/sql-programming-guide.html#scheduling
>>
>> You likely want to put each user in their own pool.
>>
>> On Tue, Oct 20, 2015 at 11:55 AM, Sadhan Sood 
>> wrote:
>>
>>> Hi All,
>>>
>>> Does anyone have fair scheduling working for them in a hive server? I
>>> have one hive thriftserver running and multiple users trying to run queries
>>> at the same time on that server using a beeline client. I see that a big
>>> query is stopping all other queries from making any progress. Is this
>>> supposed to be this way? Is there anything else that I need to be doing for
>>> fair scheduling to be working for the thriftserver?
>>>
>>
>>
>


groupByKey()

2015-12-08 Thread Yasemin Kaya
Hi,

Sorry for the long inputs but it is my situation.

i have two list and i wana grupbykey them but some value of list disapear.
i can't understand this.

(8867989628612931721,[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

(8867989628612931721,[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,* 1*, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

result of groupbykey
(8867989628612931721,[[1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

epoch date time problem to load data into in spark

2015-12-08 Thread Soni spark
Hi Friends,

I am written a spark streaming program in Java to access twitter tweets and
it is working fine. I can able to copy the twitter feeds to HDFS location
by batch wise.For  each batch, it is creating a folder with epoch time
stamp. for example,

 If i give HDFS location as *hdfs://localhost:54310/twitter/*, the files
are creating like below


*/spark/twitter/-144958080//spark/twitter/-144957984/*

I want to create a folder name like -MM-dd-HH format instead of by
default epoch format.

I want it like below so that i can do hive partitions easily to access the
data.

*/spark/twitter/2015-12-08-01/*


Any one can help me. Thank you so much in advance.


Thanks
Soniya


Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-12-08 Thread Ewan Higgs

Sean,

Thanks.
It's a developer API and doesn't appear to be exposed.

Ewan

On 07/12/15 15:06, Sean Owen wrote:

I'm not sure if this is available in Python but from 1.3 on you should
be able to call ALS.setFinalRDDStorageLevel with level "none" to ask
it to unpersist when it is done.

On Mon, Dec 7, 2015 at 1:42 PM, Ewan Higgs  wrote:

Jonathan,
Did you ever get to the bottom of this? I have some users working with Spark
in a classroom setting and our example notebooks run into problems where
there is so much spilled to disk that they run out of quota. A 1.5G input
set becomes >30G of spilled data on disk. I looked into how I could
unpersist the data so I could clean up the files, but I was unsuccessful.

We're using Spark 1.5.0

Yours,
Ewan

On 16/07/15 23:18, Stahlman, Jonathan wrote:

Hello all,

I am running the Spark recommendation algorithm in MLlib and I have been
studying its output with various model configurations.  Ideally I would like
to be able to run one job that trains the recommendation model with many
different configurations to try to optimize for performance.  A sample code
in python is copied below.

The issue I have is that each new model which is trained caches a set of
RDDs and eventually the executors run out of memory.  Is there any way in
Pyspark to unpersist() these RDDs after each iteration?  The names of the
RDDs which I gather from the UI is:

itemInBlocks
itemOutBlocks
Products
ratingBlocks
userInBlocks
userOutBlocks
users

I am using Spark 1.3.  Thank you for any help!

Regards,
Jonathan




   data_train, data_cv, data_test = data.randomSplit([99,1,1], 2)
   functions = [rating] #defined elsewhere
   ranks = [10,20]
   iterations = [10,20]
   lambdas = [0.01,0.1]
   alphas  = [1.0,50.0]

   results = []
   for ratingFunction, rank, numIterations, m_lambda, m_alpha in
itertools.product( functions, ranks, iterations, lambdas, alphas ):
 #train model
 ratings_train = data_train.map(lambda l: Rating( l.user, l.product,
ratingFunction(l) ) )
 model   = ALS.trainImplicit( ratings_train, rank, numIterations,
lambda_=float(m_lambda), alpha=float(m_alpha) )

 #test performance on CV data
 ratings_cv = data_cv.map(lambda l: Rating( l.uesr, l.product,
ratingFunction(l) ) )
 auc = areaUnderCurve( ratings_cv, model.predictAll )

 #save results
 result = ",".join(str(l) for l in
[ratingFunction.__name__,rank,numIterations,m_lambda,m_alpha,auc])
 results.append(result)



The information contained in this e-mail is confidential and/or proprietary
to Capital One and/or its affiliates and may only be used solely in
performance of work or services for Capital One. The information transmitted
herewith is intended only for use by the individual or entity to which it is
addressed. If the reader of this message is not the intended recipient, you
are hereby notified that any review, retransmission, dissemination,
distribution, copying or other use of, or taking of any action in reliance
upon this information is strictly prohibited. If you have received this
communication in error, please contact the sender and delete the material
from your computer.





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



actors and async communication between driver and workers/executors

2015-12-08 Thread Manolis Sifalakis1
I ve been looking around for some examples of information of how can the 
driver and the executors exchange information asynchronously, but have not 
found much apart from the ActorWordCount.scala streaming example that uses 
Akka.

Is there any "in-band" (within Spark) method that such communication can 
be effected, or is out-of-band use of Akka the only bet ? (Something equiv 
in python?). Ideally I do not want to have to send messages to IP 
addresses just to worker and driver IDs that spark may keep track of.

The problem at hand is the following: I would like that while workers 
number-crunch based on their RDD partition data, they asynchronously 
communicate some intermediate results back to the driver. The driver then 
may update and re-broadcast some new state var back to all the workers, 
which they can use within the same stage of computation.

TIA

Manolis.


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Need to maintain the consumer offset by myself when using spark streaming kafka direct approach?

2015-12-08 Thread PhuDuc Nguyen
Kafka Receiver-based approach:
This will maintain the consumer offsets in ZK for you.

Kafka Direct approach:
You can use checkpointing and that will maintain consumer offsets for you.
You'll want to checkpoint to a highly available file system like HDFS or S3.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#checkpointing

You don't have to maintain your own offsets if you don't want to. If the 2
solutions above don't satisfy your requirements, then consider writing your
own; otherwise I would recommend using the supported features in Spark.

HTH,
Duc



On Tue, Dec 8, 2015 at 5:05 AM, Tao Li  wrote:

> I am using spark streaming kafka direct approach these days. I found that
> when I start the application, it always start consumer the latest offset. I
> hope that when application start, it consume from the offset last
> application consumes with the same kafka consumer group. It means I have to
> maintain the consumer offset by my self, for example record it on
> zookeeper, and reload the last offset from zookeeper when restarting the
> applicaiton?
>
> I see the following discussion:
> https://github.com/apache/spark/pull/4805
> https://issues.apache.org/jira/browse/SPARK-6249
>
> Is there any conclusion? Do we need to maintain the offset by myself? Or
> spark streaming will support a feature to simplify the offset maintain work?
>
>
> https://forums.databricks.com/questions/2936/need-to-maintain-the-consumer-offset-by-myself-whe.html
>


Re: epoch date format to normal date format while loading the files to HDFS

2015-12-08 Thread Andy Davidson
Hi Sonia

I believe you are using java? Take a look at Java Date I am sure you will
find lots of examples of how to format dates

Enjoy share

Andy


/**

 * saves tweets to disk. This a replacement for

 * @param tweets

 * @param outputURI

 */

private static void saveTweets(JavaDStream jsonTweets, String
outputURI) {



/*

using saveAsTestFiles will cause lots of empty directories to be
created.

DStream data = jsonTweets.dstream();

data.saveAsTextFiles(outputURI, null);

*/



jsonTweets.foreachRDD(new Function2() {

private static final long serialVersionUID =
-5482893563183573691L;



@Override

public Void call(JavaRDD rdd, Time time) throws
Exception {

if(!rdd.isEmpty()) {

String dirPath = outputURI + "-" + time.milliseconds();

rdd.saveAsTextFile(dirPath);

}

return null;

}



});




From:  Soni spark 
Date:  Tuesday, December 8, 2015 at 6:26 AM
To:  Andrew Davidson 
Subject:  epoch date format to normal date format while loading the files to
HDFS

> Hi Andy,
> 
> How are you? i need your help again.
> 
> I have written a spark streaming program in Java to access twitter tweets and
> it is working fine. I can able to copy the twitter feeds to HDFS location by
> batch wise.For  each batch, it is creating a folder with epoch time stamp. for
> example,
> 
>  If i give HDFS location as hdfs://localhost:54310/twitter/, the files are
> creating like below
> 
> /spark/twitter/-144958080/
> /spark/twitter/-144957984/
> 
> I want to create a folder name like -MM-dd-HH format instead of by default
> epoch format.
> 
> I want it like below so that i can do hive partitions easily to access the
> data.
> 
> /spark/twitter/2015-12-08-01/
> 
> 
> Can you help me. Thank you so much in advance.
> 
> 
> Thanks
> Soniya




SparkR read.df failed to read file from local directory

2015-12-08 Thread Boyu Zhang
Hello everyone,

I tried to run the example data--manipulation.R, and can't get it to read
the flights.csv file that is stored in my local fs. I don't want to store
big files in my hdfs, so reading from a local fs (lustre fs) is the desired
behavior for me.

I tried the following:

flightsDF <- read.df(sqlContext,
"file:///home/myuser/test_data/sparkR/flights.csv", source =
"com.databricks.spark.csv", header = "true")

I got the message and eventually failed:

15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
in memory on hathi-a003.rcac.purdue.edu:33894 (size: 14.4 KB, free: 530.2
MB)
15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0
(TID 9, hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File
file:/home/myuser/test_data/sparkR/flights.csv does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
at
org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Can someone please provide comments? Any tips is appreciated, thank you!

Boyu Zhang


Associating spark jobs with logs

2015-12-08 Thread sunil m
Hello Spark experts!

I was wondering if somebody has solved the problem which we are facing.

We want to achieve the following:

Given a spark job id fetch all the logs generated by that job.

We looked at spark job server it seems to be lacking such a feature.


Any ideas, suggestions are welcome!

Thanks in advance.

Warm regards,
Sunil M.


Re: Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-12-08 Thread Deenar Toraskar
Hi Trystan

I am facing the same issue. It only appears with the thrift server, the
same call works fine via the spark-sql shell. Do you have any workarounds
and have you filed a JIRA/bug for the same?

Regards
Deenar

On 12 October 2015 at 18:01, Trystan Leftwich  wrote:

> Hi everyone,
>
> Since upgrading to spark 1.5 I've been unable to create and use UDF's when
> we run in thrift server mode.
>
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've
> also built our own spark from github branch-1.5 with the following args,
> -Pyarn -Phive -Phive-thrifeserver)
>
> if i run the following after connecting via JDBC (in this case via
> beeline):
>
> add jar 'hdfs://path/to/jar"
> (this command succeeds with no errors)
>
> CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';
> (this command succeeds with no errors)
>
> select testUDF(col1) from table1;
>
> I get the following error in the logs:
>
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1
> pos 8
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> .
> . (cutting the bulk for ease of email, more than happy to send the full
> output)
> .
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive
> query:
> org.apache.hive.service.cli.HiveSQLException:
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1
> pos 100
> at
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
>
>
> When I ran the same against 1.4 it worked.
>
> I've also changed the spark.sql.hive.metastore.version version to be 0.13
> (similar to what it was in 1.4) and 0.14 but I still get the same errors.
>
>
> Any suggestions?
>
> Thanks,
> Trystan
>
>


Re: NoSuchMethodError: com.fasterxml.jackson.databind.ObjectMapper.enable

2015-12-08 Thread Sunil Tripathy
Thanks Fengdong.


I still have the same exception.


Exception in thread "main" java.lang.NoSuchMethodError: 
com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
at 
com.amazonaws.internal.config.InternalConfig.(InternalConfig.java:43)
at 
com.amazonaws.internal.config.InternalConfig$Factory.(InternalConfig.java:304)
at com.amazonaws.util.VersionInfoUtils.userAgent(VersionInfoUtils.java:139)
at 
com.amazonaws.util.VersionInfoUtils.initializeUserAgent(VersionInfoUtils.java:134)
at 
com.amazonaws.util.VersionInfoUtils.getUserAgent(VersionInfoUtils.java:95)
at com.amazonaws.ClientConfiguration.(ClientConfiguration.java:61)
at 
com.amazonaws.services.sqs.AmazonSQSClient.(AmazonSQSClient.java:168)
at sparktest.JavaSqsReceiver.(JavaSqsReceiver.scala:28)
at sparktest.StreamingTest.receiveMessage(StreamingTest.scala:18)
at sparktest.StreamingTest$.main(StreamingTest.scala:35)
at sparktest.StreamingTest.main(StreamingTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

Will rebuilding spark help?


From: Fengdong Yu 
Sent: Monday, December 7, 2015 10:31 PM
To: Sunil Tripathy
Cc: user@spark.apache.org
Subject: Re: NoSuchMethodError: 
com.fasterxml.jackson.databind.ObjectMapper.enable

Can you try like this in your sbt:


val spark_version = "1.5.2"
val excludeServletApi = ExclusionRule(organization = "javax.servlet", artifact 
= "servlet-api")
val excludeEclipseJetty = ExclusionRule(organization = "org.eclipse.jetty")

libraryDependencies ++= Seq(
  "org.apache.spark" %%  "spark-sql"  % spark_version % "provided" 
excludeAll(excludeServletApi, excludeEclipseJetty),
  "org.apache.spark" %%  "spark-hive" % spark_version % "provided" 
excludeAll(excludeServletApi, excludeEclipseJetty)
)



On Dec 8, 2015, at 2:26 PM, Sunil Tripathy 
> wrote:


I am getting the following exception when I use spark-submit to submit a spark 
streaming job.

Exception in thread "main" java.lang.NoSuchMethodError: 
com.fasterxml.jackson.databind.ObjectMapper.enable([Lcom/fasterxml/jackson/core/JsonParser$Feature;)Lcom/fasterxml/jackson/databind/ObjectMapper;
at 
com.amazonaws.internal.config.InternalConfig.(InternalConfig.java:43)

I tried with diferent version of of jackson libraries but that does not seem to 
help.
 libraryDependencies += "com.fasterxml.jackson.core" % "jackson-databind" % 
"2.6.3"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.6.3"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-annotations" % 
"2.6.3"

Any pointers to resolve the issue?

Thanks



Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hi,

Anyone can recommend a great Graph visualization tool for GraphX  that can 
handle truly large Data (~ TB) ?

Thanks so much
Hao

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Re: Graph visualization tool for GraphX

2015-12-08 Thread Jörn Franke
I am not sure about your use case. How should a human interpret many terabytes 
of data in one large visualization?? You have to be more specific, what part of 
the data needs to be visualized, what kind of visualization, what navigation do 
you expect within the visualisation, how many users, response time, web tool vs 
mobile vs Desktop etc

> On 08 Dec 2015, at 16:46, Lin, Hao  wrote:
> 
> Hi,
>  
> Anyone can recommend a great Graph visualization tool for GraphX  that can 
> handle truly large Data (~ TB) ?
>  
> Thanks so much
> Hao
> Confidentiality Notice:: This email, including attachments, may include 
> non-public, proprietary, confidential or legally privileged information. If 
> you are not an intended recipient or an authorized agent of an intended 
> recipient, you are hereby notified that any dissemination, distribution or 
> copying of the information contained in or transmitted with this e-mail is 
> unauthorized and strictly prohibited. If you have received this email in 
> error, please notify the sender by replying to this message and permanently 
> delete this e-mail, its attachments, and any copies of it immediately. You 
> should not retain, copy or use this e-mail or any attachment for any purpose, 
> nor disclose all or any part of the contents to any other person. Thank you.


Re: can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread Ted Yu
Can you clarify your use case ?

Apart from hdfs, S3 (and possibly others) can be used.

Cheers

On Tue, Dec 8, 2015 at 9:40 AM, prateek arora 
wrote:

> Hi
>
> Is it possible into spark to write only RDD transformation into hdfs or any
> other storage system ?
>
> Regards
> Prateek
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Hello Jorn,

Thank you for the reply and being tolerant of my over simplified question. I 
should’ve been more specific.  Though ~TB of data, there will be about billions 
of records (edges) and 100,000 nodes. We need to visualize the social networks 
graph like what can be done by Gephi which has limitation on scalability to 
handle such amount of data. There will be dozens of users to access and the 
response time is also critical.  We would like to run the visualization tool on 
the remote ec2 server where webtool can be a good choice for us.

Please let me know if I need to be more specific ☺.  Thanks
hao

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Tuesday, December 08, 2015 11:31 AM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: Graph visualization tool for GraphX

I am not sure about your use case. How should a human interpret many terabytes 
of data in one large visualization?? You have to be more specific, what part of 
the data needs to be visualized, what kind of visualization, what navigation do 
you expect within the visualisation, how many users, response time, web tool vs 
mobile vs Desktop etc

On 08 Dec 2015, at 16:46, Lin, Hao 
> wrote:
Hi,

Anyone can recommend a great Graph visualization tool for GraphX  that can 
handle truly large Data (~ TB) ?

Thanks so much
Hao
Confidentiality Notice:: This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information. If you 
are not an intended recipient or an authorized agent of an intended recipient, 
you are hereby notified that any dissemination, distribution or copying of the 
information contained in or transmitted with this e-mail is unauthorized and 
strictly prohibited. If you have received this email in error, please notify 
the sender by replying to this message and permanently delete this e-mail, its 
attachments, and any copies of it immediately. You should not retain, copy or 
use this e-mail or any attachment for any purpose, nor disclose all or any part 
of the contents to any other person. Thank you.

Confidentiality Notice::  This email, including attachments, may include 
non-public, proprietary, confidential or legally privileged information.  If 
you are not an intended recipient or an authorized agent of an intended 
recipient, you are hereby notified that any dissemination, distribution or 
copying of the information contained in or transmitted with this e-mail is 
unauthorized and strictly prohibited.  If you have received this email in 
error, please notify the sender by replying to this message and permanently 
delete this e-mail, its attachments, and any copies of it immediately.  You 
should not retain, copy or use this e-mail or any attachment for any purpose, 
nor disclose all or any part of the contents to any other person. Thank you.


Merge rows into csv

2015-12-08 Thread Krishna
Hi,

what is the most efficient way to perform a group-by operation in Spark and
merge rows into csv?

Here is the current RDD
-
ID   STATE
-
1   TX
1NY
1FL
2CA
2OH
-

This is the required output:
-
IDCSV_STATE
-
1 TX,NY,FL
2 CA,OH
-


can i write only RDD transformation into hdfs or any other storage system

2015-12-08 Thread prateek arora
Hi

Is it possible into spark to write only RDD transformation into hdfs or any
other storage system ?

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/can-i-write-only-RDD-transformation-into-hdfs-or-any-other-storage-system-tp25637.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Boyu Zhang
Thanks for the comment Felix, I tried giving
"/home/myuser/test_data/sparkR/flights.csv", but it tried to search the
path in hdfs, and gave errors:

15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on
org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.hadoop.mapred.InvalidInputException: Input path does not
exist: hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv
at
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$

Thanks,
Boyu

On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheung 
wrote:

> Have you tried
>
> flightsDF <- read.df(sqlContext,
> "/home/myuser/test_data/sparkR/flights.csv", source =
> "com.databricks.spark.csv", header = "true")
>
>
>
> _
> From: Boyu Zhang 
> Sent: Tuesday, December 8, 2015 8:47 AM
> Subject: SparkR read.df failed to read file from local directory
> To: 
>
>
>
> Hello everyone,
>
> I tried to run the example data--manipulation.R, and can't get it to read
> the flights.csv file that is stored in my local fs. I don't want to store
> big files in my hdfs, so reading from a local fs (lustre fs) is the desired
> behavior for me.
>
> I tried the following:
>
> flightsDF <- read.df(sqlContext,
> "file:///home/myuser/test_data/sparkR/flights.csv", source =
> "com.databricks.spark.csv", header = "true")
>
> I got the message and eventually failed:
>
> 15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0
> in memory on hathi-a003.rcac.purdue.edu:33894 (size: 14.4 KB, free: 530.2
> MB)
> 15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 3.0 (TID 9, hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException:
> File file:/home/myuser/test_data/sparkR/flights.csv does not exist
> at
> org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
>
> at
> org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
>
> at
> org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
> at
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
> at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
> Can someone please provide comments? Any tips is appreciated, thank you!
>
> Boyu Zhang
>
>
>
>


Re: Graph visualization tool for GraphX

2015-12-08 Thread andy petrella
Hello Lin,

This is indeed a tough scenario when you have many vertices (and even
worst) many edges...

So two-fold answer:
First, technically, there is a graph plotting support in the spark notebook
(https://github.com/andypetrella/spark-notebook/ → check this notebook:
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb).
You can plot graph from scala, which will convert to D3 with force layout
force field.
The number or the points which you will plot are "sampled" using a
`Sampler` that you can provide yourself. Which leads to the second fold of
this answer.

Plotting a large graph is rather tough because there is no real notion of
dimension... there is always the option to dig the topological analysis
theory to find good homeomorphism ... but won't be that efficient ;-D.
Best is to find a good approach to generalize/summarize the information,
there are many many techniques (that you can find in mainly geospatial viz
and biology viz theories...)
Best is to check what will match your need the fastest.
There are quick techniques like using unsupervised clustering models and
then plot a voronoi diagram (which can be approached using force layout).

In general term I might say that multiscaling is intuitively what you want
first: this is an interesting paper presenting the foundations:
https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf

Oh and BTW, to end this longish mail, while looking for new papers on that,
I felt on this one:
http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf which
is using
1. *Spark !!!*
2. a tile based approach (~ to tiling + pyramids in geospatial)

HTH

PS regarding the Spark Notebook, you can always come and discuss on gitter:
https://gitter.im/andypetrella/spark-notebook


On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao  wrote:

> Hello Jorn,
>
>
>
> Thank you for the reply and being tolerant of my over simplified question.
> I should’ve been more specific.  Though ~TB of data, there will be about
> billions of records (edges) and 100,000 nodes. We need to visualize the
> social networks graph like what can be done by Gephi which has limitation
> on scalability to handle such amount of data. There will be dozens of users
> to access and the response time is also critical.  We would like to run the
> visualization tool on the remote ec2 server where webtool can be a good
> choice for us.
>
>
>
> Please let me know if I need to be more specific J.  Thanks
>
> hao
>
>
>
> *From:* Jörn Franke [mailto:jornfra...@gmail.com]
> *Sent:* Tuesday, December 08, 2015 11:31 AM
> *To:* Lin, Hao
> *Cc:* user@spark.apache.org
> *Subject:* Re: Graph visualization tool for GraphX
>
>
>
> I am not sure about your use case. How should a human interpret many
> terabytes of data in one large visualization?? You have to be more
> specific, what part of the data needs to be visualized, what kind of
> visualization, what navigation do you expect within the visualisation, how
> many users, response time, web tool vs mobile vs Desktop etc
>
>
> On 08 Dec 2015, at 16:46, Lin, Hao  wrote:
>
> Hi,
>
>
>
> Anyone can recommend a great Graph visualization tool for GraphX  that can
> handle truly large Data (~ TB) ?
>
>
>
> Thanks so much
>
> Hao
>
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
> Confidentiality Notice:: This email, including attachments, may include
> non-public, proprietary, confidential or legally privileged information. If
> you are not an intended recipient or an authorized agent of an intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of the information contained in or transmitted with this e-mail is
> unauthorized and strictly prohibited. If you have received this email in
> error, please notify the sender by replying to this message and permanently
> delete this e-mail, its attachments, and any copies of it immediately. You
> should not retain, copy or use this e-mail or any attachment for any
> purpose, nor disclose all or any part of the contents to any other person.
> Thank you.
>
-- 
andy


Re: SparkR read.df failed to read file from local directory

2015-12-08 Thread Felix Cheung
Have you tried
flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")    



_
From: Boyu Zhang 
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To:  


   Hello everyone,  
  I tried to run the example data--manipulation.R, and can't get it to 
read the flights.csv file that is stored in my local fs. I don't want to store 
big files in my hdfs, so reading from a local fs (lustre fs) is the desired 
behavior for me.  
  I tried the following:  
   flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv", source = 
"com.databricks.spark.csv", header = "true")    
   
  I got the message and eventually failed:  
   15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added 
broadcast_6_piece0 in memory on  hathi-a003.rcac.purdue.edu:33894 (size: 
14.4 KB, free: 530.2 MB) 15/12/08 11:42:41 WARN 
scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 9,  
hathi-a003.rcac.purdue.edu): java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist  at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
  at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)  
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
  at 
org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)   
   at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763) 
 at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106) 
 at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
  at 
org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)  
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)  at 
org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)  at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)  at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)  at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)  
at org.apache.spark.scheduler.Task.run(Task.scala:88)  at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)   
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
 at java.lang.Thread.run(Thread.java:745)  
  Can someone please provide comments? Any tips is appreciated, thank 
you!  
  Boyu Zhang  
   


  

Spark metrics not working

2015-12-08 Thread Jesse F Chen

v1.5.1.

Trying to enable CsvSink for metrics collecting, but I get the following
error as soon as kicking off a 'spark-submit' app:


   15/12/08 11:24:02 INFO storage.BlockManagerMaster: Registered
   BlockManager
   15/12/08 11:24:02 ERROR metrics.MetricsSystem: Sink class
   org.apache.spark.metrics.sink.CsvSink cannot be instantialized
   15/12/08 11:24:02 ERROR spark.SparkContext: Error initializing
   SparkContext.
   java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
   Method)
   at sun.reflect.NativeConstructorAccessorImpl.newInstance
   (NativeConstructorAccessorImpl.java:57)
   at sun.reflect.DelegatingConstructorAccessorImpl.newInstance
   (DelegatingConstructorAccessorImpl.java:45)
   at java.lang.reflect.Constructor.newInstance
   (Constructor.java:526)

Only made one change -- edited my 'metrics.properties' file which now
contains the following settings:

   worker.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
   master.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
   executor.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
   driver.sink.csv.class=org.apache.spark.metrics.sink.CsvSink

   # Polling period for CsvSink
   *.sink.csv.period=5
   *.sink.csv.unit=second
   # Polling directory for CsvSink
   *.sink.csv.directory=/TestAutomation/results

According to documentation, what's all I needed to set 
http://spark.apache.org/docs/latest/monitoring.html

  Do I need to rebuild my spark distro with a special flag or something?

  No much info on this on the Web at all.

  Idea? Pointers? Thanks!


Re: Associating spark jobs with logs

2015-12-08 Thread Ted Yu
Have you looked at the REST API section of:

https://spark.apache.org/docs/latest/monitoring.html

FYI

On Tue, Dec 8, 2015 at 8:57 AM, sunil m <260885smanik...@gmail.com> wrote:

> Hello Spark experts!
>
> I was wondering if somebody has solved the problem which we are facing.
>
> We want to achieve the following:
>
> Given a spark job id fetch all the logs generated by that job.
>
> We looked at spark job server it seems to be lacking such a feature.
>
>
> Any ideas, suggestions are welcome!
>
> Thanks in advance.
>
> Warm regards,
> Sunil M.
>


Re: About Spark On Hbase

2015-12-08 Thread censj
Can you get me a example?
I want to update base data.
> 在 2015年12月9日,15:19,Fengdong Yu  写道:
> 
> https://github.com/nerdammer/spark-hbase-connector 
> 
> 
> This is better and easy to use.
> 
> 
> 
> 
> 
>> On Dec 9, 2015, at 3:04 PM, censj > > wrote:
>> 
>> hi all,
>>  now I using spark,but I not found spark operation hbase open source. Do 
>> any one tell me? 
>>  
> 



回复: Re: About Spark On Hbase

2015-12-08 Thread fightf...@163.com
I don't think it really need CDH component. Just use the API 



fightf...@163.com
 
发件人: censj
发送时间: 2015-12-09 15:31
收件人: fightf...@163.com
抄送: user@spark.apache.org
主题: Re: About Spark On Hbase
But this is dependent on CDH。I not install CDH。
在 2015年12月9日,15:18,fightf...@163.com 写道:

Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase 
Also, HBASE-13992  already integrates that feature into the hbase side, but 
that feature has not been released. 

Best,
Sun.



fightf...@163.com
 
From: censj
Date: 2015-12-09 15:04
To: user@spark.apache.org
Subject: About Spark On Hbase
hi all,
 now I using spark,but I not found spark operation hbase open source. Do 
any one tell me? 



Re: About Spark On Hbase

2015-12-08 Thread censj
So, I how to get this jar? I use set package project.I not found sbt lib.
> 在 2015年12月9日,15:42,fightf...@163.com 写道:
> 
> I don't think it really need CDH component. Just use the API 
> 
> fightf...@163.com 
>  
> 发件人: censj 
> 发送时间: 2015-12-09 15:31
> 收件人: fightf...@163.com 
> 抄送: user@spark.apache.org 
> 主题: Re: About Spark On Hbase
> But this is dependent on CDH。I not install CDH。
>> 在 2015年12月9日,15:18,fightf...@163.com  写道:
>> 
>> Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase 
>>  
>> Also, HBASE-13992   
>> already integrates that feature into the hbase side, but 
>> that feature has not been released. 
>> 
>> Best,
>> Sun.
>> 
>> fightf...@163.com 
>>  
>> From: censj 
>> Date: 2015-12-09 15:04
>> To: user@spark.apache.org 
>> Subject: About Spark On Hbase
>> hi all,
>>  now I using spark,but I not found spark operation hbase open source. Do 
>> any one tell me? 



Re: is repartition very cost

2015-12-08 Thread Zhiliang Zhu
Thanks very much for Yong's help.
Sorry that for one more issue, is it that different partitions must be in 
different nodes? that is, each node would only have one partition, in cluster 
mode ...  


On Wednesday, December 9, 2015 6:41 AM, "Young, Matthew T" 
 wrote:
 

 #yiv1938266569 #yiv1938266569 -- _filtered #yiv1938266569 
{font-family:Helvetica;panose-1:2 11 6 4 2 2 2 2 2 4;} _filtered #yiv1938266569 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv1938266569 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv1938266569 
{font-family:Cambria;panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1938266569 
#yiv1938266569 p.yiv1938266569MsoNormal, #yiv1938266569 
li.yiv1938266569MsoNormal, #yiv1938266569 div.yiv1938266569MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv1938266569 a:link, 
#yiv1938266569 span.yiv1938266569MsoHyperlink 
{color:#0563C1;text-decoration:underline;}#yiv1938266569 a:visited, 
#yiv1938266569 span.yiv1938266569MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1938266569 
p.yiv1938266569msonormal0, #yiv1938266569 li.yiv1938266569msonormal0, 
#yiv1938266569 div.yiv1938266569msonormal0 
{margin-right:0in;margin-left:0in;font-size:12.0pt;}#yiv1938266569 
span.yiv1938266569EmailStyle18 
{color:windowtext;font-weight:normal;font-style:normal;text-decoration:none 
none;}#yiv1938266569 .yiv1938266569MsoChpDefault {font-size:10.0pt;} _filtered 
#yiv1938266569 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv1938266569 
div.yiv1938266569WordSection1 {}#yiv1938266569 Shuffling large amounts of data 
over the network is expensive, yes. The cost is lower if you are just using a 
single node where no networking needs to be involved to do the repartition 
(using Spark as a multithreading engine).    In general you need to do 
performance testing to see if a repartition is worth the shuffle time.    A 
common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.    From: Zhiliang Zhu 
[mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User 
Subject: is repartition very cost       Hi All,    I need to do optimize 
objective function with some linear constraints by  genetic algorithm.  I would 
like to make as much parallelism for it by spark.    repartition / shuffle may 
be used sometimes in it, however, is repartition API very cost ?    Thanks in 
advance! Zhiliang       

  

Re: spark-defaults.conf optimal configuration

2015-12-08 Thread nsalian
Hi Chris,

Thank you for posting the question.
Tuning spark configurations is a tricky task since there are a lot factors
to consider.
The configurations that you listed cover the most them.

To understand the situation that can guide you in making a decision about
tuning:
1) What kind of spark applications are you intending to run?
2) What cluster manager have you decided to go with? 
3) How frequent are these applications going to run? (For the sake of
scheduling)
4) Is this used by multiple users? 
5) What else do you have in the cluster that will interact with Spark? (For
the sake of resolving dependencies)
Personally, I would suggest to have these questions  prior to jumping on the
idea of tuning.
A cluster manager like YARN would help understand the settings for cores and
memory since the applications have to be considered for scheduling.

Hope that helps to start off in the right direction.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Local Mode: Executor thread leak?

2015-12-08 Thread Shixiong Zhu
Could you send a PR to fix it? Thanks!

Best Regards,
Shixiong Zhu

2015-12-08 13:31 GMT-08:00 Richard Marscher :

> Alright I was able to work through the problem.
>
> So the owning thread was one from the executor task launch worker, which
> at least in local mode runs the task and the related user code of the task.
> After judiciously naming every thread in the pools in the user code (with a
> custom ThreadFactory) I was able to trace down the leak to a couple thread
> pools that were not shut down properly by noticing the named threads
> accumulating in thread dumps of the JVM process.
>
> On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher  > wrote:
>
>> Thanks for the response.
>>
>> The version is Spark 1.5.2.
>>
>> Some examples of the thread names:
>>
>> pool-1061-thread-1
>> pool-1059-thread-1
>> pool-1638-thread-1
>>
>> There become hundreds then thousands of these stranded in WAITING.
>>
>> I added logging to try to track the lifecycle of the thread pool in
>> Executor as mentioned before. Here is an excerpt, but every seems fine
>> there. Every executor that starts is also shut down and it seems like it
>> shuts down fine.
>>
>> 15/12/07 23:30:21 WARN o.a.s.e.Executor: Threads finished in executor
>> driver. pool shut down 
>> java.util.concurrent.ThreadPoolExecutor@e5d036b[Terminated,
>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
>> 15/12/07 23:30:28 WARN o.a.s.e.Executor: Executor driver created, thread
>> pool: java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Running, pool
>> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> 15/12/07 23:31:06 WARN o.a.s.e.Executor: Threads finished in executor
>> driver. pool shut down 
>> java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Terminated,
>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 36]
>> 15/12/07 23:31:11 WARN o.a.s.e.Executor: Executor driver created, thread
>> pool: java.util.concurrent.ThreadPoolExecutor@6e85ece4[Running, pool
>> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
>> 15/12/07 23:34:35 WARN o.a.s.e.Executor: Threads finished in executor
>> driver. pool shut down 
>> java.util.concurrent.ThreadPoolExecutor@6e85ece4[Terminated,
>> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
>>
>> Also here is an example thread dump of such a thread:
>>
>> "pool-493-thread-1" prio=10 tid=0x7f0e60612800 nid=0x18c4 waiting on
>> condition [0x7f0c33c3e000]
>>java.lang.Thread.State: WAITING (parking)
>> at sun.misc.Unsafe.park(Native Method)
>> - parking to wait for  <0x7f10b3e8fb60> (a
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>> at
>> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
>> at
>> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
>> at
>> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>> at
>> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> On Mon, Dec 7, 2015 at 6:23 PM, Shixiong Zhu  wrote:
>>
>>> Which version are you using? Could you post these thread names here?
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-12-07 14:30 GMT-08:00 Richard Marscher :
>>>
 Hi,

 I've been running benchmarks against Spark in local mode in a long
 running process. I'm seeing threads leaking each time it runs a job. It
 doesn't matter if I recycle SparkContext constantly or have 1 context stay
 alive for the entire application lifetime.

 I see a huge accumulation ongoing of "pool--thread-1" threads with
 the creating thread "Executor task launch worker-xx" where x's are numbers.
 The number of leaks per launch worker varies but usually 1 to a few.

 Searching the Spark code the pool is created in the Executor class. It
 is `.shutdown()` in the stop for the executor. I've wired up logging and
 also tried shutdownNow() and awaitForTermination on the pools. Every seems
 okay there for every Executor that is called with `stop()` but I'm still
 not sure yet if every Executor is called as such, which I am looking into
 now.

 What I'm curious to know is if anyone has seen a similar issue?

 --
 *Richard Marscher*
 Software Engineer
 Localytics
 Localytics.com  | Our Blog
  | Twitter 
  | Facebook  | LinkedIn
 

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is 
lower if you are just using a single node where no networking needs to be 
involved to do the repartition (using Spark as a multithreading engine).

In general you need to do performance testing to see if a repartition is worth 
the shuffle time.

A common model is to repartition the data once after ingest to achieve 
parallelism and avoid shuffles whenever possible later.

From: Zhiliang Zhu [mailto:zchl.j...@yahoo.com.INVALID]
Sent: Tuesday, December 08, 2015 5:05 AM
To: User 
Subject: is repartition very cost


Hi All,

I need to do optimize objective function with some linear constraints by  
genetic algorithm.
I would like to make as much parallelism for it by spark.

repartition / shuffle may be used sometimes in it, however, is repartition API 
very cost ?

Thanks in advance!
Zhiliang




RE: Graph visualization tool for GraphX

2015-12-08 Thread Lin, Hao
Thanks Andy, I certainly will give a try to your suggestion.

From: andy petrella [mailto:andy.petre...@gmail.com]
Sent: Tuesday, December 08, 2015 1:21 PM
To: Lin, Hao; Jörn Franke
Cc: user@spark.apache.org
Subject: Re: Graph visualization tool for GraphX

Hello Lin,

This is indeed a tough scenario when you have many vertices (and even worst) 
many edges...

So two-fold answer:
First, technically, there is a graph plotting support in the spark notebook 
(https://github.com/andypetrella/spark-notebook/[github.com]
 → check this notebook: 
https://github.com/andypetrella/spark-notebook/blob/master/notebooks/viz/Graph%20Plots.snb[github.com]).
 You can plot graph from scala, which will convert to D3 with force layout 
force field.
The number or the points which you will plot are "sampled" using a `Sampler` 
that you can provide yourself. Which leads to the second fold of this answer.

Plotting a large graph is rather tough because there is no real notion of 
dimension... there is always the option to dig the topological analysis theory 
to find good homeomorphism ... but won't be that efficient ;-D.
Best is to find a good approach to generalize/summarize the information, there 
are many many techniques (that you can find in mainly geospatial viz and 
biology viz theories...)
Best is to check what will match your need the fastest.
There are quick techniques like using unsupervised clustering models and then 
plot a voronoi diagram (which can be approached using force layout).

In general term I might say that multiscaling is intuitively what you want 
first: this is an interesting paper presenting the foundations: 
https://www.cs.ubc.ca/~tmm/courses/533-07/readings/auberIV03Seattle.pdf[cs.ubc.ca]

Oh and BTW, to end this longish mail, while looking for new papers on that, I 
felt on this one: 
http://vacommunity.org/egas2015/papers/IGAS2015-ScottLangevin.pdf[vacommunity.org]
 which is using
1. Spark !!!
2. a tile based approach (~ to tiling + pyramids in geospatial)

HTH

PS regarding the Spark Notebook, you can always come and discuss on gitter: 
https://gitter.im/andypetrella/spark-notebook[gitter.im]


On Tue, Dec 8, 2015 at 6:30 PM Lin, Hao 
> wrote:
Hello Jorn,

Thank you for the reply and being tolerant of my over simplified question. I 
should’ve been more specific.  Though ~TB of data, there will be about billions 
of records (edges) and 100,000 nodes. We need to visualize the social networks 
graph like what can be done by Gephi which has limitation on scalability to 
handle such amount of data. There will be dozens of users to access and the 
response time is also critical.  We would like to run the visualization tool on 
the remote ec2 server where webtool can be a good choice for us.

Please let me know if I need to be more specific ☺.  Thanks
hao

From: Jörn Franke [mailto:jornfra...@gmail.com]
Sent: Tuesday, December 08, 2015 11:31 AM
To: Lin, Hao
Cc: user@spark.apache.org
Subject: Re: Graph visualization tool for GraphX

I am not sure about your use case. How should a human interpret many terabytes 
of data in one large visualization?? You have to be more specific, what part of 
the data needs to be visualized, what kind of visualization, what navigation do 
you expect within the visualisation, how many users, response time, web tool vs 
mobile vs Desktop etc

On 08 Dec 2015, at 16:46, Lin, Hao 
> wrote:
Hi,

Anyone can recommend a great Graph visualization tool for GraphX  that can 

RE: Executor metrics in spark application

2015-12-08 Thread SRK
Hi,

Were you able to setup custom metrics in GangliaSink? If so, how did you
register the custom metrics?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Executor-metrics-in-spark-application-tp188p25647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkR read.df failed to read file from local directory

2015-12-08 Thread Sun, Rui
Hi, Boyu,

Does the local file “/home/myuser/test_data/sparkR/flights.csv” really exist?

I just tried, and had no problem creating a DataFrame from a local CSV file.

From: Boyu Zhang [mailto:boyuzhan...@gmail.com]
Sent: Wednesday, December 9, 2015 1:49 AM
To: Felix Cheung
Cc: user@spark.apache.org
Subject: Re: SparkR read.df failed to read file from local directory

Thanks for the comment Felix, I tried giving 
"/home/myuser/test_data/sparkR/flights.csv", but it tried to search the path in 
hdfs, and gave errors:

15/12/08 12:47:10 ERROR r.RBackendHandler: loadDF on 
org.apache.spark.sql.api.r.SQLUtils failed
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://hostname:8020/home/myuser/test_data/sparkR/flights.csv
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
at org.apache.spark.rdd.RDD$$

Thanks,
Boyu

On Tue, Dec 8, 2015 at 12:38 PM, Felix Cheung 
> wrote:
Have you tried

flightsDF <- read.df(sqlContext, "/home/myuser/test_data/sparkR/flights.csv", 
source = "com.databricks.spark.csv", header = "true")


_
From: Boyu Zhang >
Sent: Tuesday, December 8, 2015 8:47 AM
Subject: SparkR read.df failed to read file from local directory
To: >


Hello everyone,

I tried to run the example data--manipulation.R, and can't get it to read the 
flights.csv file that is stored in my local fs. I don't want to store big files 
in my hdfs, so reading from a local fs (lustre fs) is the desired behavior for 
me.

I tried the following:

flightsDF <- read.df(sqlContext, 
"file:///home/myuser/test_data/sparkR/flights.csv",
 source = "com.databricks.spark.csv", header = "true")

I got the message and eventually failed:

15/12/08 11:42:41 INFO storage.BlockManagerInfo: Added broadcast_6_piece0 in 
memory on 
hathi-a003.rcac.purdue.edu:33894 
(size: 14.4 KB, free: 530.2 MB)
15/12/08 11:42:41 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 
(TID 9, hathi-a003.rcac.purdue.edu): 
java.io.FileNotFoundException: File 
file:/home/myuser/test_data/sparkR/flights.csv does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
at 
org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:398)
at 
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:137)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:339)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:763)
at org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:106)
at 
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:239)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Can someone please provide comments? Any tips is appreciated, thank you!

Boyu Zhang





Re: Associating spark jobs with logs

2015-12-08 Thread sunil m
Thanks for replying ...
Yes i did.
I am not seeing the application-ids  for jobs submitted to YARN when i
query http://MY_HOST:18080/api/v1/applications/

When I query
http://MY_HOST:18080/api/v1/applications/application_1446812769803_0011 it
does not understand the application_id since it belongs to YARN.

I am looking for a feature like this but we need to get logs irrespective
of the master being YARN, MESSOS or stand-alone spark.

Warm regards,
Sunil M.

On 9 December 2015 at 00:48, Ted Yu  wrote:

> Have you looked at the REST API section of:
>
> https://spark.apache.org/docs/latest/monitoring.html
>
> FYI
>
> On Tue, Dec 8, 2015 at 8:57 AM, sunil m <260885smanik...@gmail.com> wrote:
>
>> Hello Spark experts!
>>
>> I was wondering if somebody has solved the problem which we are facing.
>>
>> We want to achieve the following:
>>
>> Given a spark job id fetch all the logs generated by that job.
>>
>> We looked at spark job server it seems to be lacking such a feature.
>>
>>
>> Any ideas, suggestions are welcome!
>>
>> Thanks in advance.
>>
>> Warm regards,
>> Sunil M.
>>
>
>


Re: About Spark On Hbase

2015-12-08 Thread fightf...@163.com
Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase 
Also, HBASE-13992  already integrates that feature into the hbase side, but 
that feature has not been released. 

Best,
Sun.



fightf...@163.com
 
From: censj
Date: 2015-12-09 15:04
To: user@spark.apache.org
Subject: About Spark On Hbase
hi all,
 now I using spark,but I not found spark operation hbase open source. Do 
any one tell me? 
 


Re: About Spark On Hbase

2015-12-08 Thread censj
But this is dependent on CDH。I not install CDH。
> 在 2015年12月9日,15:18,fightf...@163.com 写道:
> 
> Actually you can refer to https://github.com/cloudera-labs/SparkOnHBase 
>  
> Also, HBASE-13992   
> already integrates that feature into the hbase side, but 
> that feature has not been released. 
> 
> Best,
> Sun.
> 
> fightf...@163.com 
>  
> From: censj 
> Date: 2015-12-09 15:04
> To: user@spark.apache.org 
> Subject: About Spark On Hbase
> hi all,
>  now I using spark,but I not found spark operation hbase open source. Do 
> any one tell me? 



Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi,

Where does *.sink.csv.directory  directory get created? I cannot see nay
metrics in logs. How did you verify consoleSink and csvSink?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p25643.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Divya Gehlot
Hi,
As per requirement I need to use Spark 1.4.1.But HDP doesnt comes with
Spark 1.4.1 version.
As instructed in  this hortonworks page

I am able to set up Spark 1.4 in HDP ,but when I run the spark shell It
shows Spark 1.3.1 REPL instead of spark 1.4.1 .
Do I need to make any configuration changes apart from instructions given
in above mentioned page.
How do I set spark 1.4.1 as the default spark engine.
Options :
1.Do I need to remove the current spark 1.3.1 dir.
2. I named my spark installation dir names spark 1.4.1,Do I need to rename
as Spark.
3.Is there configuration needs to change  in HDP to set spark 1.4.1 as
default spark engine .

Would really appreciate your help.

Thanks,
Regards,


Re: Spark metrics for ganglia

2015-12-08 Thread swetha kasireddy
Hi,

How to verify whether the GangliaSink directory got created?

Thanks,
Swetha

On Mon, Dec 15, 2014 at 11:29 AM, danilopds  wrote:

> Thanks tsingfu,
>
> I used this configuration based in your post: (with ganglia unicast mode)
> # Enable GangliaSink for all instances
> *.sink.ganglia.class=org.apache.spark.metrics.sink.GangliaSink
> *.sink.ganglia.host=10.0.0.7
> *.sink.ganglia.port=8649
> *.sink.ganglia.period=15
> *.sink.ganglia.unit=seconds
> *.sink.ganglia.ttl=1
> *.sink.ganglia.mode=unicast
>
> Then,
> I have the following error now.
> ERROR metrics.MetricsSystem: Sink class
> org.apache.spark.metrics.sink.GangliaSink  cannot be instantialized
> java.lang.ClassNotFoundException: org.apache.spark.metrics.sink.GangliaSink
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-metrics-for-ganglia-tp14335p20690.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to get custom metrics using Ganglia Sink?

2015-12-08 Thread SRK
Hi,

How do I configure custom metrics using Ganglia Sink?

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-custom-metrics-using-Ganglia-Sink-tp25645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: set up spark 1.4.1 as default spark engine in HDP 2.2/2.3

2015-12-08 Thread Saisai Shao
Please make sure the spark shell script you're running is pointed to
/bin/spark-shell

Just follow the instructions to correctly configure your spark 1.4.1 and
execute correct script are enough.

On Wed, Dec 9, 2015 at 11:28 AM, Divya Gehlot 
wrote:

> Hi,
> As per requirement I need to use Spark 1.4.1.But HDP doesnt comes with
> Spark 1.4.1 version.
> As instructed in  this hortonworks page
> 
> I am able to set up Spark 1.4 in HDP ,but when I run the spark shell It
> shows Spark 1.3.1 REPL instead of spark 1.4.1 .
> Do I need to make any configuration changes apart from instructions given
> in above mentioned page.
> How do I set spark 1.4.1 as the default spark engine.
> Options :
> 1.Do I need to remove the current spark 1.3.1 dir.
> 2. I named my spark installation dir names spark 1.4.1,Do I need to rename
> as Spark.
> 3.Is there configuration needs to change  in HDP to set spark 1.4.1 as
> default spark engine .
>
> Would really appreciate your help.
>
> Thanks,
> Regards,
>


Re: About Spark On Hbase

2015-12-08 Thread Fengdong Yu
https://github.com/nerdammer/spark-hbase-connector

This is better and easy to use.





> On Dec 9, 2015, at 3:04 PM, censj  wrote:
> 
> hi all,
>  now I using spark,but I not found spark operation hbase open source. Do 
> any one tell me? 
>  



Re: Spark Java.lang.NullPointerException

2015-12-08 Thread michael_han
Hi Sarala,

Thanks for your reply. But it doesn't work.

I tried the following 2 commands:
*<1>*
spark-submit --master local --name "SparkTest App" --class
com.qad.SparkTest1
target/Spark-Test-1.0.jar;c:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

with error: c:\spark-1.5.2-bin-hadoop2.6\bin>spark-submit --master local
--name "SparkTest A
pp" --class com.qad.SparkTest1
target/Spark-Test-1.0.jar;c:/spark-1.5.2-bin-hado
op2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar
Warning: Local jar
c:\spark-1.5.2-bin-hadoop2.6\bin\target\Spark-Test-1.0.jar;c:
\spark-1.5.2-bin-hadoop2.6\lib\spark-assembly-1.5.2-hadoop2.6.0.jar does not
exi
st, skipping.

*<2>*
spark-submit --master local --name "SparkTest App" --class
com.qad.SparkTest1 target/Spark-Test-1.0.jar --jars
c:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar

with error as before: Spark Java.lang.NullPointerException



spark-submit --master local --name "SparkTest App" --class
com.qad.SparkTest1 target/Spark-Test-1.0.jar --jars
c:/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Java-lang-NullPointerException-tp25562p25646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



About Spark On Hbase

2015-12-08 Thread censj
hi all,
 now I using spark,but I not found spark operation hbase open source. Do 
any one tell me? 
 

Re: Can't create UDF's in spark 1.5 while running using the hive thrift service

2015-12-08 Thread Jeff Zhang
It is fixed in 1.5.3

https://issues.apache.org/jira/browse/SPARK-11191


On Wed, Dec 9, 2015 at 12:58 AM, Deenar Toraskar 
wrote:

> Hi Trystan
>
> I am facing the same issue. It only appears with the thrift server, the
> same call works fine via the spark-sql shell. Do you have any workarounds
> and have you filed a JIRA/bug for the same?
>
> Regards
> Deenar
>
> On 12 October 2015 at 18:01, Trystan Leftwich  wrote:
>
>> Hi everyone,
>>
>> Since upgrading to spark 1.5 I've been unable to create and use UDF's
>> when we run in thrift server mode.
>>
>> Our setup:
>> We start the thrift-server running against yarn in client mode, (we've
>> also built our own spark from github branch-1.5 with the following args,
>> -Pyarn -Phive -Phive-thrifeserver)
>>
>> if i run the following after connecting via JDBC (in this case via
>> beeline):
>>
>> add jar 'hdfs://path/to/jar"
>> (this command succeeds with no errors)
>>
>> CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';
>> (this command succeeds with no errors)
>>
>> select testUDF(col1) from table1;
>>
>> I get the following error in the logs:
>>
>> org.apache.spark.sql.AnalysisException: undefined function testUDF; line
>> 1 pos 8
>> at
>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>> at
>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
>> at scala.Option.getOrElse(Option.scala:120)
>> at
>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
>> at
>> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
>> at scala.util.Try.getOrElse(Try.scala:77)
>> at
>> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
>> at
>> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
>> at
>> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>> at
>> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>> at
>> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>> .
>> . (cutting the bulk for ease of email, more than happy to send the full
>> output)
>> .
>> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running
>> hive query:
>> org.apache.hive.service.cli.HiveSQLException:
>> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1
>> pos 100
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at javax.security.auth.Subject.doAs(Subject.java:422)
>> at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
>> at
>> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
>> at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>> When I ran the same against 1.4 it worked.
>>
>> I've also changed the spark.sql.hive.metastore.version version to be 0.13
>> (similar to what it was in 1.4) and 0.14 but I still 

Re: Can not see any spark metrics on ganglia-web

2015-12-08 Thread SRK
Hi,

I cannot see any metrics as well. How did you verify  ConsoleSink and
CSVSink works OK? Where does *.sink.csv.directory  get created?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Can-not-see-any-spark-metrics-on-ganglia-web-tp14981p25644.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to use collections inside foreach block

2015-12-08 Thread Madabhattula Rajesh Kumar
Hi,

I have a below query. Please help me to solve this

I have a 2 ids. I want to join these ids to table. This table contains
some blob data. So i can not join these 2000 ids to this table in one step.

I'm planning to join this table in a chunks. For example, each step I will
join 5000 ids.

Below code is not working. I'm not able to add result to ListBuffer. Result
s giving always ZERO

*Code Block :-*

var listOfIds is a ListBuffer with 2 records

listOfIds.grouped(5000).foreach { x =>
{
var v1 = new ListBuffer[String]()
val r = sc.parallelize(x).toDF()
r.registerTempTable("r")
var result = sqlContext.sql("SELECT r.id, t.data from r, t where r.id = t.id
")
 result.foreach{ y =>
 {
 v1 += y
  }
}
println(" SIZE OF V1 === "+ v1.size)  ==>

*THIS VALUE PRINTING AS ZERO*

*// Save v1 values to other db*
}

Please help me on this.

Regards,
Rajesh


Re: Exception in Spark-sql insertIntoJDBC command

2015-12-08 Thread kali.tumm...@gmail.com
Hi All, 

I have the same error in spark 1.5 is there any solution to get around with
this ? I also tried using sourcedf.write.mode("append") but still no luck .

val sourcedfmode=sourcedf.write.mode("append")
sourcedfmode.jdbc(TargetDBinfo.url,TargetDBinfo.table,targetprops)

Thanks
Sri 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Exception-in-Spark-sql-insertIntoJDBC-command-tp24655p25640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-defaults.conf optimal configuration

2015-12-08 Thread cjrumble
I am seeking help with a Spark configuration running queries against a
cluster of 6 machines. Each machine has Spark 1.5.1 with slaves started on 6
and 1 acting as master/thriftserver. I query from Beeline 2 tables that have
300M and 31M rows respectively. Results from my queries thus far return up
to 500M rows when queried using Oracle but Spark errors at anything more
than 5.5M rows. 

I believe there is an optimal memory configuration that must be set for each
of the workers in our cluster but I have not been able to determine that
setting. Is there something better than trial and error? Are there settings
to avoid such as making sure not to set spark.driver.maxResultSize >
spark.driver.memory?

Is there a formula or guidelines by which to calculate the correct Spark
configuration values when given a machines available cores and memory
resources? 

This is my current configuration:
BDA v3 server : SUN SERVER X4-2L
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
CPU cores : 32
GB of memory (>=63): 63
number of disks : 12spark-defaults.conf

spark.driver.memory 20g
spark.executor.memory 40g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.rpc.askTimeout6000s
spark.rpc.lookupTimeout3000s
spark.driver.maxResultSize20g
spark.rdd.compress   true
spark.storage.memoryFraction1
spark.core.connection.ack.wait.timeout 600
spark.akka.frameSize500
spark.shuffle.compress  true
spark.shuffle.file.buffer 128k
spark.shuffle.memoryFraction0
spark.shuffle.spill.compress   true
spark.shuffle.spill true

Thank you,

Chris



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



INotifyDStream - where to find it?

2015-12-08 Thread octagon blue
Hi All,

I am using pyspark streaming to ETL raw data files as they land on HDFS.
While researching this topic I found this great presentation about Spark
and Spark Streaming at Uber
(http://www.slideshare.net/databricks/spark-meetup-at-uber), where they
mention this INotifyDStream that sounds very interesting and like it may
suit my use case well.

Does anyone know if this code has been submitted to apache, or how I
might otherwise come upon it? 

Reference: https://issues.apache.org/jira/browse/SPARK-10555 - Add
INotifyDStream to Spark Streaming

Thanks!

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Local Mode: Executor thread leak?

2015-12-08 Thread Richard Marscher
Alright I was able to work through the problem.

So the owning thread was one from the executor task launch worker, which at
least in local mode runs the task and the related user code of the task.
After judiciously naming every thread in the pools in the user code (with a
custom ThreadFactory) I was able to trace down the leak to a couple thread
pools that were not shut down properly by noticing the named threads
accumulating in thread dumps of the JVM process.

On Mon, Dec 7, 2015 at 6:41 PM, Richard Marscher 
wrote:

> Thanks for the response.
>
> The version is Spark 1.5.2.
>
> Some examples of the thread names:
>
> pool-1061-thread-1
> pool-1059-thread-1
> pool-1638-thread-1
>
> There become hundreds then thousands of these stranded in WAITING.
>
> I added logging to try to track the lifecycle of the thread pool in
> Executor as mentioned before. Here is an excerpt, but every seems fine
> there. Every executor that starts is also shut down and it seems like it
> shuts down fine.
>
> 15/12/07 23:30:21 WARN o.a.s.e.Executor: Threads finished in executor
> driver. pool shut down 
> java.util.concurrent.ThreadPoolExecutor@e5d036b[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 1]
> 15/12/07 23:30:28 WARN o.a.s.e.Executor: Executor driver created, thread
> pool: java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Running, pool size
> = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 15/12/07 23:31:06 WARN o.a.s.e.Executor: Threads finished in executor
> driver. pool shut down 
> java.util.concurrent.ThreadPoolExecutor@3bc41ae3[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 36]
> 15/12/07 23:31:11 WARN o.a.s.e.Executor: Executor driver created, thread
> pool: java.util.concurrent.ThreadPoolExecutor@6e85ece4[Running, pool size
> = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 15/12/07 23:34:35 WARN o.a.s.e.Executor: Threads finished in executor
> driver. pool shut down 
> java.util.concurrent.ThreadPoolExecutor@6e85ece4[Terminated,
> pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 288]
>
> Also here is an example thread dump of such a thread:
>
> "pool-493-thread-1" prio=10 tid=0x7f0e60612800 nid=0x18c4 waiting on
> condition [0x7f0c33c3e000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f10b3e8fb60> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043)
> at
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
> at
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> On Mon, Dec 7, 2015 at 6:23 PM, Shixiong Zhu  wrote:
>
>> Which version are you using? Could you post these thread names here?
>>
>> Best Regards,
>> Shixiong Zhu
>>
>> 2015-12-07 14:30 GMT-08:00 Richard Marscher :
>>
>>> Hi,
>>>
>>> I've been running benchmarks against Spark in local mode in a long
>>> running process. I'm seeing threads leaking each time it runs a job. It
>>> doesn't matter if I recycle SparkContext constantly or have 1 context stay
>>> alive for the entire application lifetime.
>>>
>>> I see a huge accumulation ongoing of "pool--thread-1" threads with
>>> the creating thread "Executor task launch worker-xx" where x's are numbers.
>>> The number of leaks per launch worker varies but usually 1 to a few.
>>>
>>> Searching the Spark code the pool is created in the Executor class. It
>>> is `.shutdown()` in the stop for the executor. I've wired up logging and
>>> also tried shutdownNow() and awaitForTermination on the pools. Every seems
>>> okay there for every Executor that is called with `stop()` but I'm still
>>> not sure yet if every Executor is called as such, which I am looking into
>>> now.
>>>
>>> What I'm curious to know is if anyone has seen a similar issue?
>>>
>>> --
>>> *Richard Marscher*
>>> Software Engineer
>>> Localytics
>>> Localytics.com  | Our Blog
>>>  | Twitter 
>>>  | Facebook  | LinkedIn
>>> 
>>>
>>
>>
>
>
> --
> *Richard Marscher*
> Software Engineer
> Localytics
> Localytics.com  | Our Blog
>  | Twitter  |
> Facebook 

Re: Merge rows into csv

2015-12-08 Thread ayan guha
reduceByKey would be a perfect fit for you

On Wed, Dec 9, 2015 at 4:47 AM, Krishna  wrote:

> Hi,
>
> what is the most efficient way to perform a group-by operation in Spark
> and merge rows into csv?
>
> Here is the current RDD
> -
> ID   STATE
> -
> 1   TX
> 1NY
> 1FL
> 2CA
> 2OH
> -
>
> This is the required output:
> -
> IDCSV_STATE
> -
> 1 TX,NY,FL
> 2 CA,OH
> -
>



-- 
Best Regards,
Ayan Guha