How to do dispatching in Streaming?

2015-04-12 Thread Jianshi Huang
Hi,

I have a Kafka topic that contains dozens of different types of messages.
And for each one I'll need to create a DStream for it.

Currently I have to filter the Kafka stream over and over, which is very
inefficient.

So what's the best way to do dispatching in Spark Streaming? (one DStream
- multiple DStreams)


Thanks,
-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


counters in spark

2015-04-12 Thread Grandl Robert
Hi guys,
I was trying to figure out some counters in Spark, related to the amount of CPU 
or Memory used (in some metric), used by a task/stage/job, but I could not find 
any. 
Is there any such counter available ?
Thank you,Robert





Re: How to use Joda Time with Spark SQL?

2015-04-12 Thread Justin Yip
Cheng, this is great info. I have a follow up question. There are a few
very common data types (i.e. Joda DateTime) that is not directly supported
by SparkSQL. Do you know if there are any plans for accommodating some
common data types in SparkSQL? They don't need to be a first class
datatype, but if they are available as UDT and provided by the SparkSQL
library, that will make DataFrame users' life easier.

Justin

On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com wrote:

 One possible approach can be defining a UDT (user-defined type) for Joda
 time. A UDT maps an arbitrary type to and from Spark SQL data types. You
 may check the ExamplePointUDT [1] for more details.

 [1]: https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c
 2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/
 test/ExamplePointUDT.scala


 On 4/8/15 6:09 AM, adamgerst wrote:

 I've been using Joda Time in all my spark jobs (by using the nscala-time
 package) and have not run into any issues until I started trying to use
 spark sql.  When I try to convert a case class that has a
 com.github.nscala_time.time.Imports.DateTime object in it, an exception
 is
 thrown for with a MatchError

 My assumption is that this is because the basic types of spark sql are
 java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what
 to
 do about the DateTime value.

 How can I get around this? I would prefer not to have to change my code to
 make the values be Timestamps but I'm concerned that might be the only
 way.
 Would something like implicit conversions work here?

 It seems that even if I specify the schema manually then I would still
 have
 the issue since you have to specify the column type which has to be of
 type
 org.apache.spark.sql.types.DataType



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark TeraSort source request

2015-04-12 Thread Ewan Higgs

Hi all.
The code is linked from my repo:

https://github.com/ehiggs/spark-terasort

This is an example Spark program for running TeraSort benchmarks. It is 
based on work from Reynold Xin's branch 
https://github.com/rxin/spark/tree/terasort, but it is not the same 
TeraSort program that currently holds the record 
http://sortbenchmark.org/. That program is here 
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort.



That program is here links to:
https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort

I've been working on other projects at the moment so I haven't returned 
to the spark-terasort stuff. If you have any pull requests, I would be 
very grateful.


Yours,
Ewan

On 08/04/15 03:26, Pramod Biligiri wrote:

+1. I would love to have the code for this as well.

Pramod

On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com 
mailto:thubregt...@gmail.com wrote:


Hi all,

As we all know, Spark has set the record for sorting data, as
published on:
https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html.

Here at our group, we would love to verify these results, and compare
machine using this benchmark. We've spend quite some time trying
to find the
terasort source code that was used, but can not find it anywhere.

We did find two candidates:

A version posted by Reynold [1], the posted of the message above. This
version is stuck at // TODO: Add partition-local (external)
sorting
using TeraSortRecordOrdering, only generating data.

Here, Ewan noticed that it didn't appear to be similar to Hadoop
TeraSort.
[2] After this he created a version on his own [3]. With this
version, we
noticed problems with TeraValidate with datasets above ~10G (as
mentioned by
others at [4]. When examining the raw input and output files, it
actually
appears that the input data is sorted and the output data unsorted
in both
cases.

Because of this, we believe we did not yet find the actual used
source code.
I've tried to search in the Spark User forum archive's, seeing
request of
people, indicating a demand, but did not succeed in finding the actual
source code.

My question:
Could you guys please make the source code of the used TeraSort
program,
preferably with settings, available? If not, what are the reasons
that this
seems to be withheld?

Thanks for any help,

Tom Hubregtsen

[1]

https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87
[2]

http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E
[3] https://github.com/ehiggs/spark-terasort
[4]

http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org






RE: How to use Joda Time with Spark SQL?

2015-04-12 Thread Wang, Daoyuan
Actually, I did a little investigation on joda time when I was working on 
SPARK-4987 for Timestamp ser-de in parquet format. I think Joda offers 
interface to get java object from joda time object natively.

For example, to transform a java.util.Date (parent of java.sql.Date and 
java.sql.Timestamp) object named jd, in jave code you can use
DateTime dt = new DateTime(jd);
Or in scala code
val dt: DateTime = new DateTime(jd)

On the other hand, giving a DateTime object named dt, you can use code like
val jd: java.sql.Timestamp = new Timestamp(dt.getMillis)
to get the java object.

Thanks,
Daoyuan.

From: Cheng Lian [mailto:lian.cs@gmail.com]
Sent: Sunday, April 12, 2015 11:51 PM
To: Justin Yip
Cc: adamgerst; user@spark.apache.org
Subject: Re: How to use Joda Time with Spark SQL?

These common UDTs can always be wrapped in libraries and published to 
spark-packages http://spark-packages.org/ :-)

Cheng
On 4/12/15 3:00 PM, Justin Yip wrote:
Cheng, this is great info. I have a follow up question. There are a few very 
common data types (i.e. Joda DateTime) that is not directly supported by 
SparkSQL. Do you know if there are any plans for accommodating some common data 
types in SparkSQL? They don't need to be a first class datatype, but if they 
are available as UDT and provided by the SparkSQL library, that will make 
DataFrame users' life easier.

Justin

On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian 
lian.cs@gmail.commailto:lian.cs@gmail.com wrote:
One possible approach can be defining a UDT (user-defined type) for Joda time. 
A UDT maps an arbitrary type to and from Spark SQL data types. You may check 
the ExamplePointUDT [1] for more details.

[1]: 
https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala


On 4/8/15 6:09 AM, adamgerst wrote:
I've been using Joda Time in all my spark jobs (by using the nscala-time
package) and have not run into any issues until I started trying to use
spark sql.  When I try to convert a case class that has a
com.github.nscala_time.time.Imports.DateTime object in it, an exception is
thrown for with a MatchError

My assumption is that this is because the basic types of spark sql are
java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to
do about the DateTime value.

How can I get around this? I would prefer not to have to change my code to
make the values be Timestamps but I'm concerned that might be the only way.
Would something like implicit conversions work here?

It seems that even if I specify the schema manually then I would still have
the issue since you have to specify the column type which has to be of type
org.apache.spark.sql.types.DataType



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org




Re: How to use Joda Time with Spark SQL?

2015-04-12 Thread Cheng Lian
These common UDTs can always be wrapped in libraries and published to 
spark-packages http://spark-packages.org/ :-)


Cheng

On 4/12/15 3:00 PM, Justin Yip wrote:
Cheng, this is great info. I have a follow up question. There are a 
few very common data types (i.e. Joda DateTime) that is not directly 
supported by SparkSQL. Do you know if there are any plans for 
accommodating some common data types in SparkSQL? They don't need to 
be a first class datatype, but if they are available as UDT and 
provided by the SparkSQL library, that will make DataFrame users' life 
easier.


Justin

On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com 
mailto:lian.cs@gmail.com wrote:


One possible approach can be defining a UDT (user-defined type)
for Joda time. A UDT maps an arbitrary type to and from Spark SQL
data types. You may check the ExamplePointUDT [1] for more details.

[1]:

https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala



On 4/8/15 6:09 AM, adamgerst wrote:

I've been using Joda Time in all my spark jobs (by using the
nscala-time
package) and have not run into any issues until I started
trying to use
spark sql.  When I try to convert a case class that has a
com.github.nscala_time.time.Imports.DateTime object in it, an
exception is
thrown for with a MatchError

My assumption is that this is because the basic types of spark
sql are
java.sql.Timestamp and java.sql.Date and therefor spark
doesn't know what to
do about the DateTime value.

How can I get around this? I would prefer not to have to
change my code to
make the values be Timestamps but I'm concerned that might be
the only way.
Would something like implicit conversions work here?

It seems that even if I specify the schema manually then I
would still have
the issue since you have to specify the column type which has
to be of type
org.apache.spark.sql.types.DataType



--
View this message in context:

http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html
Sent from the Apache Spark User List mailing list archive at
Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
mailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
mailto:user-h...@spark.apache.org






Re: function to convert to pair

2015-04-12 Thread Jeetendra Gangele
I have to create some kind of index from my JavaRDDObject it should be
something like javaPairRDDuniqueindex, Object

but zipWith Index giving Object, Long later I need to use this RDD for
join so its looks it wont work for me.

On 9 April 2015 at 04:17, Ted Yu yuzhih...@gmail.com wrote:

 Please take a look at zipWithIndex() of RDD.

 Cheers

 On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele gangele...@gmail.com
 wrote:

 Hi All I have a RDDSomeObject I want to convert it to
 RDDsequenceNumber,SomeObject this sequence number can be 1 for first
 SomeObject 2 for second SomeOjejct


 Regards
 jeet





Re: regarding ZipWithIndex

2015-04-12 Thread Ted Yu
bq. will return something like JavaPairRDDObject, long

The long component of the pair fits your description of index. What other
requirement does ZipWithIndex not provide you ?

Cheers

On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele gangele...@gmail.com
wrote:

 Hi All I have an RDD JavaRDDObject and I want to convert it to
 JavaPairRDDIndex,Object.. Index should be unique and it should maintain
 the order. For first object It should have 1 and then for second 2 like
 that.

 I tried using ZipWithIndex but it will return something like
 JavaPairRDDObject, long
 I wanted to use this RDD for lookup and join operation later in my
 workflow so ordering is important.


 Regards
 jeet



regarding ZipWithIndex

2015-04-12 Thread Jeetendra Gangele
Hi All I have an RDD JavaRDDObject and I want to convert it to
JavaPairRDDIndex,Object.. Index should be unique and it should maintain
the order. For first object It should have 1 and then for second 2 like
that.

I tried using ZipWithIndex but it will return something like
JavaPairRDDObject, long
I wanted to use this RDD for lookup and join operation later in my workflow
so ordering is important.


Regards
jeet