How to do dispatching in Streaming?
Hi, I have a Kafka topic that contains dozens of different types of messages. And for each one I'll need to create a DStream for it. Currently I have to filter the Kafka stream over and over, which is very inefficient. So what's the best way to do dispatching in Spark Streaming? (one DStream - multiple DStreams) Thanks, -- Jianshi Huang LinkedIn: jianshi Twitter: @jshuang Github Blog: http://huangjs.github.com/
counters in spark
Hi guys, I was trying to figure out some counters in Spark, related to the amount of CPU or Memory used (in some metric), used by a task/stage/job, but I could not find any. Is there any such counter available ? Thank you,Robert
Re: How to use Joda Time with Spark SQL?
Cheng, this is great info. I have a follow up question. There are a few very common data types (i.e. Joda DateTime) that is not directly supported by SparkSQL. Do you know if there are any plans for accommodating some common data types in SparkSQL? They don't need to be a first class datatype, but if they are available as UDT and provided by the SparkSQL library, that will make DataFrame users' life easier. Justin On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com wrote: One possible approach can be defining a UDT (user-defined type) for Joda time. A UDT maps an arbitrary type to and from Spark SQL data types. You may check the ExamplePointUDT [1] for more details. [1]: https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c 2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/ test/ExamplePointUDT.scala On 4/8/15 6:09 AM, adamgerst wrote: I've been using Joda Time in all my spark jobs (by using the nscala-time package) and have not run into any issues until I started trying to use spark sql. When I try to convert a case class that has a com.github.nscala_time.time.Imports.DateTime object in it, an exception is thrown for with a MatchError My assumption is that this is because the basic types of spark sql are java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to do about the DateTime value. How can I get around this? I would prefer not to have to change my code to make the values be Timestamps but I'm concerned that might be the only way. Would something like implicit conversions work here? It seems that even if I specify the schema manually then I would still have the issue since you have to specify the column type which has to be of type org.apache.spark.sql.types.DataType -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark TeraSort source request
Hi all. The code is linked from my repo: https://github.com/ehiggs/spark-terasort This is an example Spark program for running TeraSort benchmarks. It is based on work from Reynold Xin's branch https://github.com/rxin/spark/tree/terasort, but it is not the same TeraSort program that currently holds the record http://sortbenchmark.org/. That program is here https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort. That program is here links to: https://github.com/rxin/spark/tree/sort-benchmark/core/src/main/scala/org/apache/spark/sort I've been working on other projects at the moment so I haven't returned to the spark-terasort stuff. If you have any pull requests, I would be very grateful. Yours, Ewan On 08/04/15 03:26, Pramod Biligiri wrote: +1. I would love to have the code for this as well. Pramod On Fri, Apr 3, 2015 at 12:47 PM, Tom thubregt...@gmail.com mailto:thubregt...@gmail.com wrote: Hi all, As we all know, Spark has set the record for sorting data, as published on: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html. Here at our group, we would love to verify these results, and compare machine using this benchmark. We've spend quite some time trying to find the terasort source code that was used, but can not find it anywhere. We did find two candidates: A version posted by Reynold [1], the posted of the message above. This version is stuck at // TODO: Add partition-local (external) sorting using TeraSortRecordOrdering, only generating data. Here, Ewan noticed that it didn't appear to be similar to Hadoop TeraSort. [2] After this he created a version on his own [3]. With this version, we noticed problems with TeraValidate with datasets above ~10G (as mentioned by others at [4]. When examining the raw input and output files, it actually appears that the input data is sorted and the output data unsorted in both cases. Because of this, we believe we did not yet find the actual used source code. I've tried to search in the Spark User forum archive's, seeing request of people, indicating a demand, but did not succeed in finding the actual source code. My question: Could you guys please make the source code of the used TeraSort program, preferably with settings, available? If not, what are the reasons that this seems to be withheld? Thanks for any help, Tom Hubregtsen [1] https://github.com/rxin/spark/commit/adcae69145905162fa3b6932f70be2c932f95f87 [2] http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3c5462092c.1060...@ugent.be%3E [3] https://github.com/ehiggs/spark-terasort [4] http://mail-archives.apache.org/mod_mbox/spark-user/201501.mbox/%3CCAPszQwgap4o1inZkTwcwV=7scwoqtr5yxfnsqo5p2kgp1bn...@mail.gmail.com%3E -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-TeraSort-source-request-tp22371.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
RE: How to use Joda Time with Spark SQL?
Actually, I did a little investigation on joda time when I was working on SPARK-4987 for Timestamp ser-de in parquet format. I think Joda offers interface to get java object from joda time object natively. For example, to transform a java.util.Date (parent of java.sql.Date and java.sql.Timestamp) object named jd, in jave code you can use DateTime dt = new DateTime(jd); Or in scala code val dt: DateTime = new DateTime(jd) On the other hand, giving a DateTime object named dt, you can use code like val jd: java.sql.Timestamp = new Timestamp(dt.getMillis) to get the java object. Thanks, Daoyuan. From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Sunday, April 12, 2015 11:51 PM To: Justin Yip Cc: adamgerst; user@spark.apache.org Subject: Re: How to use Joda Time with Spark SQL? These common UDTs can always be wrapped in libraries and published to spark-packages http://spark-packages.org/ :-) Cheng On 4/12/15 3:00 PM, Justin Yip wrote: Cheng, this is great info. I have a follow up question. There are a few very common data types (i.e. Joda DateTime) that is not directly supported by SparkSQL. Do you know if there are any plans for accommodating some common data types in SparkSQL? They don't need to be a first class datatype, but if they are available as UDT and provided by the SparkSQL library, that will make DataFrame users' life easier. Justin On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.commailto:lian.cs@gmail.com wrote: One possible approach can be defining a UDT (user-defined type) for Joda time. A UDT maps an arbitrary type to and from Spark SQL data types. You may check the ExamplePointUDT [1] for more details. [1]: https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala On 4/8/15 6:09 AM, adamgerst wrote: I've been using Joda Time in all my spark jobs (by using the nscala-time package) and have not run into any issues until I started trying to use spark sql. When I try to convert a case class that has a com.github.nscala_time.time.Imports.DateTime object in it, an exception is thrown for with a MatchError My assumption is that this is because the basic types of spark sql are java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to do about the DateTime value. How can I get around this? I would prefer not to have to change my code to make the values be Timestamps but I'm concerned that might be the only way. Would something like implicit conversions work here? It seems that even if I specify the schema manually then I would still have the issue since you have to specify the column type which has to be of type org.apache.spark.sql.types.DataType -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.orgmailto:user-h...@spark.apache.org
Re: How to use Joda Time with Spark SQL?
These common UDTs can always be wrapped in libraries and published to spark-packages http://spark-packages.org/ :-) Cheng On 4/12/15 3:00 PM, Justin Yip wrote: Cheng, this is great info. I have a follow up question. There are a few very common data types (i.e. Joda DateTime) that is not directly supported by SparkSQL. Do you know if there are any plans for accommodating some common data types in SparkSQL? They don't need to be a first class datatype, but if they are available as UDT and provided by the SparkSQL library, that will make DataFrame users' life easier. Justin On Sat, Apr 11, 2015 at 5:41 AM, Cheng Lian lian.cs@gmail.com mailto:lian.cs@gmail.com wrote: One possible approach can be defining a UDT (user-defined type) for Joda time. A UDT maps an arbitrary type to and from Spark SQL data types. You may check the ExamplePointUDT [1] for more details. [1]: https://github.com/apache/spark/blob/694aef0d71d2683eaf63cbd1d8e95c2da423b72e/sql/core/src/main/scala/org/apache/spark/sql/test/ExamplePointUDT.scala On 4/8/15 6:09 AM, adamgerst wrote: I've been using Joda Time in all my spark jobs (by using the nscala-time package) and have not run into any issues until I started trying to use spark sql. When I try to convert a case class that has a com.github.nscala_time.time.Imports.DateTime object in it, an exception is thrown for with a MatchError My assumption is that this is because the basic types of spark sql are java.sql.Timestamp and java.sql.Date and therefor spark doesn't know what to do about the DateTime value. How can I get around this? I would prefer not to have to change my code to make the values be Timestamps but I'm concerned that might be the only way. Would something like implicit conversions work here? It seems that even if I specify the schema manually then I would still have the issue since you have to specify the column type which has to be of type org.apache.spark.sql.types.DataType -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-Joda-Time-with-Spark-SQL-tp22415.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org mailto:user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org mailto:user-h...@spark.apache.org
Re: function to convert to pair
I have to create some kind of index from my JavaRDDObject it should be something like javaPairRDDuniqueindex, Object but zipWith Index giving Object, Long later I need to use this RDD for join so its looks it wont work for me. On 9 April 2015 at 04:17, Ted Yu yuzhih...@gmail.com wrote: Please take a look at zipWithIndex() of RDD. Cheers On Wed, Apr 8, 2015 at 3:40 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have a RDDSomeObject I want to convert it to RDDsequenceNumber,SomeObject this sequence number can be 1 for first SomeObject 2 for second SomeOjejct Regards jeet
Re: regarding ZipWithIndex
bq. will return something like JavaPairRDDObject, long The long component of the pair fits your description of index. What other requirement does ZipWithIndex not provide you ? Cheers On Sun, Apr 12, 2015 at 1:16 PM, Jeetendra Gangele gangele...@gmail.com wrote: Hi All I have an RDD JavaRDDObject and I want to convert it to JavaPairRDDIndex,Object.. Index should be unique and it should maintain the order. For first object It should have 1 and then for second 2 like that. I tried using ZipWithIndex but it will return something like JavaPairRDDObject, long I wanted to use this RDD for lookup and join operation later in my workflow so ordering is important. Regards jeet
regarding ZipWithIndex
Hi All I have an RDD JavaRDDObject and I want to convert it to JavaPairRDDIndex,Object.. Index should be unique and it should maintain the order. For first object It should have 1 and then for second 2 like that. I tried using ZipWithIndex but it will return something like JavaPairRDDObject, long I wanted to use this RDD for lookup and join operation later in my workflow so ordering is important. Regards jeet