spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Does spark streaming consumer for kinesis uses Kinesis Client Library and mandates to checkpoint the sequence number of shards in dynamo db. Will it lead to dataloss if consumed datarecords are not yet processed and kinesis checkpointed the consumed sequenece numbers in dynamo db and spark

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Mich Talebzadeh
This is an interesting point. As far as I know in any database (practically all RDBMS Oracle, SAP etc), the LIMIT affects the collection part of the result set. The result set is carried out fully on the query that may involve multiple joins on multiple underlying tables. To limit the actual

Using a Custom Data Store with Spark 2.0

2016-10-24 Thread Sachith Withana
Hi all, I have a requirement to integrate a custom data store to be used with Spark ( v2.0.1). It consists of structured data in tables along with the schemas. Then I want to run SparkSQL queries on the data and provide the data back to the data service. I'm wondering what would be the best way

[Spark 2.0.1] Error in generated code, possible regression?

2016-10-24 Thread Efe Selcuk
I have an application that works in 2.0.0 but has been dying at runtime on the 2.0.1 distribution. at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:893) at

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
Hi, The only thing you can do for Kinesis checkpoints is tune the interval of them. https://github.com/apache/spark/blob/master/external/ kinesis-asl/src/main/scala/org/apache/spark/streaming/ kinesis/KinesisUtils.scala#L68 Whether the dataloss occurs or not depends on the storage level you set;

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 20:32, Cheng Lian > wrote: On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian > wrote: What version of Spark are you using and

Re: LIMIT issue of SparkSQL

2016-10-24 Thread Michael Armbrust
It is not about limits on specific tables. We do support that. The case I'm describing involves pushing limits across system boundaries. It is certainly possible to do this, but the current datasource API does provide this information (other than the implicit limit that is pushed down to the

Modifying Metadata in StructType schemas

2016-10-24 Thread Everett Anderson
Hi, I've been using the immutable Metadata within the StructType of a DataFrame/Dataset to track application-level column lineage. However, since it's immutable, the only way to modify it is to do a full trip of 1. Convert DataFrame/Dataset to Row RDD 2. Create new, modified Metadata per

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
Yes, thanks for elaborating Michael. The other thing that I wanted to highlight was that in this specific case the value is actually exactly zero (0E-18 = 0*10^(-18) = 0). On Mon, Oct 24, 2016 at 8:50 PM, Michael Matsko wrote: > Efe, > > I think Jakob's point is that that

RE: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Mendelson, Assaf
Hi, I believe that the UDF is only a small part of the problem. You can easily test by doing a UDF for dataframe too. In my testing I saw that using datasets can be considerably slower than dataframe. I can make a guess as to why this happens. Basically what you are doing in a dataframe is

[Spark 2] BigDecimal and 0

2016-10-24 Thread Efe Selcuk
I’m trying to track down what seems to be a very slight imprecision in our Spark application; two of our columns, which should be netting out to exactly zero, are coming up with very small fractions of non-zero value. The only thing that I’ve found out of place is that a case class entry into a

Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
Thanks, Cheng Lian. I try to use case class: scala> case class Tags (category: String, weight: String) scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =

Re: spark streaming with kinesis

2016-10-24 Thread Shushant Arora
Thanks! Is kinesis streams are receiver based only? Is there non receiver based consumer for Kinesis ? And Instead of having fixed checkpoint interval,Can I disable auto checkpoint and say when my worker has processed the data after last record of mapPartition now checkpoint the sequence no

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Efe Selcuk
I should have noted that I understand the notation of 0E-18 (exponential form, I think) and that in a normal case it is no different than 0; I just wanted to make sure that there wasn't something tricky going on since the representation was seemingly changing. Michael, that's a fair point. I keep

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
What you're seeing is merely a strange representation, 0E-18 is zero. The E-18 represents the precision that Spark uses to store the decimal On Mon, Oct 24, 2016 at 7:32 PM, Jakob Odersky wrote: > An even smaller example that demonstrates the same behaviour: > >

Re: Get size of intermediate results

2016-10-24 Thread Takeshi Yamamuro
-dev +user Hi, You have tried this? scala> val df = Seq((1, 0), (2, 0), (3, 0), (4, 0)).toDF.cache scala> df.queryExecution.executedPlan(0).execute().foreach(x => Unit) scala> df.rdd.toDebugString res4: String = (4) MapPartitionsRDD[13] at rdd at :26 [] | MapPartitionsRDD[12] at rdd at :26

Re: spark streaming with kinesis

2016-10-24 Thread Takeshi Yamamuro
I'm not exactly sure about the receiver you pointed though, if you point the "KinesisReceiver" implementation, yes. Also, we currently cannot disable the interval checkpoints. On Tue, Oct 25, 2016 at 11:53 AM, Shushant Arora wrote: > Thanks! > > Is kinesis streams

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Michael Matsko
Efe, I think Jakob's point is that that there is no problem. When you deal with real numbers, you don't get exact representations of numbers. There is always some slop in representations, things don't ever cancel out exactly. Testing reals for equality to zero will almost never work.

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Efe Selcuk
Okay, so this isn't contributing to any kind of imprecision. I suppose I need to go digging further then. Thanks for the quick help. On Mon, Oct 24, 2016 at 7:34 PM Jakob Odersky wrote: > What you're seeing is merely a strange representation, 0E-18 is zero. > The E-18

Re: [Spark 2] BigDecimal and 0

2016-10-24 Thread Jakob Odersky
An even smaller example that demonstrates the same behaviour: Seq(Data(BigDecimal(0))).toDS.head On Mon, Oct 24, 2016 at 7:03 PM, Efe Selcuk wrote: > I’m trying to track down what seems to be a very slight imprecision in our > Spark application; two of our columns, which

Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
scala> mblog_tags.dtypes res13: Array[(String, String)] = Array((tags,ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true))) scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
Sounds like you are running in standalone mode. Have you checked the UI on port 4040 (default) to see where memory is going. Why do you need executor memory of 10GB? How many executors are running and plus how many slaves started? In standalone mode executors run on workers (UI 8080) HTH Dr

pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread pietrop
Hi there, I opened a question on StackOverflow at this link: http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-dateformat-pattern-in-spark-read-load-for-dates?noredirect=1#comment67297930_40007972 I didn’t get any useful answer, so I’m writing here hoping that someone can

Re: JAVA heap space issue

2016-10-24 Thread Sankar Mittapally
Hi Mich, Yes, I am using standalone mode cluster, We have two executors with 10G memory each. We have two workers. FYI.. On Mon, Oct 24, 2016 at 5:22 PM, Mich Talebzadeh wrote: > Sounds like you are running in standalone mode. > > Have you checked the UI on port

Shortest path with directed and weighted graphs

2016-10-24 Thread Brian Wilson
I have been looking at the ShortestPaths function inbuilt with Spark here . Am I correct in saying there is no support for weighted graphs with this function? By that I mean that

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
I actually think this is a general problem with usage of DateFormat and SimpleDateFormat across the code, in that it relies on the default locale of the JVM. I believe this needs to, at least, default consistently to Locale.US so that behavior is consistent; otherwise it's possible that parsing

JAVA heap space issue

2016-10-24 Thread sankarmittapally
Hi, I have a three node cluster with 30G of Memory. I am trying to analyzing the data of 200MB and running out of memory every time. This is the command I am using Driver Memory = 10G Executor memory=10G sc <- sparkR.session(master =

why is that two stages in apache spark are computing same thing?

2016-10-24 Thread maitraythaker
I have a spark optimization query that I have posted on StackOverflow, any guidance on this would be appreciated. Please follow the link below, I have explained the problem in depth here with code.

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within coming few days.. 2016-10-24 21:32 GMT+09:00 Sean Owen : > I actually think this is a general problem with usage of DateFormat and > SimpleDateFormat across the code, in that it relies on the default

Re: JAVA heap space issue

2016-10-24 Thread Sankar Mittapally
sc <- sparkR.session(master = "spark://ip-172-31-6-116:7077",sparkConfig=list(spark.executor.memory="10g", spark.app.name="Testing",spark.driver.memory="14g",spark.executor.extraJavaOption="-Xms2g -Xmx5g -XX:-UseGCOverheadLimit",spark.driver.extraJavaOption="-Xms2g -Xmx5g

Re: JAVA heap space issue

2016-10-24 Thread Sankar Mittapally
I have lot of joint SQL operations, which is blocking me write data and unresisted the data, if not useful. On Oct 24, 2016 7:50 PM, "Mich Talebzadeh" wrote: > OK so you are disabling broadcasting although it is not obvious how this > helps in this case! > > Dr Mich

Re: Spark Sql 2.0 throws null pointer exception

2016-10-24 Thread Selvam Raman
​Why i could not able to access sparksession instance within foreachpartition(i have created sparksession instance within main fucntion). spark.sql("select 1").count or any sql queries which return within foreachpartition throws nullpointer exception. please give me some idea if you have faced the

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
Rather strange as you have plenty free memory there. Try reducing driver memory to 2GB and executer memory to 2GB and run it again ${SPARK_HOME}/bin/spark-submit \ --driver-memory 2G \ --num-executors 2 \ --executor-cores 1 \

Re: JAVA heap space issue

2016-10-24 Thread Sankar Mittapally
Hi Mich, I am able to write the files to storage after adding extra parameter. FYI.. This one I used. spark.sql.autoBroadcastJoinThreshold="-1" On Mon, Oct 24, 2016 at 7:22 PM, Mich Talebzadeh wrote: > Rather strange as you have plenty free memory there. > >

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
OK so you are disabling broadcasting although it is not obvious how this helps in this case! Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
This is more of an OS-level thing, but I think that if you can manage to set -Duser.language=en to the JVM, it might do the trick. I summarized what I think I know about this at https://issues.apache.org/jira/browse/SPARK-18076 and so we can decide what to do, if anything, there. Sean On Mon,

Re: JAVA heap space issue

2016-10-24 Thread Mich Talebzadeh
OK so what is your full launch code now? I mean equivalent to spark-submit Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw *

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
Thank you, I’ll appreciate that. I have no experience with Python, Java and Spark, so I the question can be translated to: “How can I set JVM locale when using spark-submit and pyspark?”. Probably this is possible only by changing the system defaul locale and not within the Spark session,

Re: Shortest path with directed and weighted graphs

2016-10-24 Thread Michael Malak
Chapter 6 of my book implements Dijkstra's Algorithm. The source code is available to download for free.  https://www.manning.com/books/spark-graphx-in-action From: Brian Wilson To: user@spark.apache.org Sent: Monday, October 24, 2016 7:11 AM Subject:

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
If you're really sure that 4 executors are on 1 machine, then it means your resource manager allowed it. What are you using, YARN? check that you really are limited to 40 cores per machine in the YARN config. On Mon, Oct 24, 2016 at 3:33 PM TheGeorge1918 . wrote: > Hi

Re: reading info from spark 2.0 application UI

2016-10-24 Thread Sean Owen
What matters in this case is how many vcores YARN thinks it can allocate per machine. I think the relevant setting is yarn.nodemanager.resource.cpu-vcores. I bet you'll find this is actually more than the machine's number of cores, possibly on purpose, to enable some over-committing. On Mon, Oct

Accessing Phoenix table from Spark 2.0., any cure!

2016-10-24 Thread Mich Talebzadeh
My stack is this Spark: Spark 2.0.0 Zookeeper: ZooKeeper 3.4.6 Hbase: hbase-1.2.3 Phoenix: apache-phoenix-4.8.1-HBase-1.2-bin I am running this simple code scala> val df = sqlContext.load("org.apache.phoenix.spark", | Map("table" -> "MARKETDATAHBASE", "zkUrl" -> "rhes564:2181") | )

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
This worked without setting other options: spark/bin/spark-submit --conf "spark.driver.extraJavaOptions=-Duser.language=en" test.py Thank you again! Pietro > Il giorno 24 ott 2016, alle ore 17:18, Sean Owen ha > scritto: > > I believe it will be too late to set it there,

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Sean Owen
I believe it will be too late to set it there, and these are JVM flags, not app or Spark flags. See spark.driver.extraJavaOptions and likewise for the executor. On Mon, Oct 24, 2016 at 4:04 PM Pietro Pugni wrote: > Thank you! > > I tried again setting locale options in

Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Mich Talebzadeh
me being lazy Does anyone have a library to create an array of random numbers from normal distribution with a given mean and variance by any chance? Something like here

Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Pietro Pugni
Thank you! I tried again setting locale options in different ways but doesn’t propagate to the JVM. I tested these strategies (alone and all together): - bin/spark-submit --conf "spark.executor.extraJavaOptions=-Duser.language=en -Duser.region=US -Duser.country=US -Duser.timezone=GMT” test.py -

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Jörn Franke
Bigtop contains a random data generator mainly for transactions, but it could be rather easily adapted > On 24 Oct 2016, at 18:04, Mich Talebzadeh wrote: > > me being lazy > > Does anyone have a library to create an array of random numbers from normal >

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Mich Talebzadeh
thanks Jorn. I wish we had these libraries somewhere :) Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Sean Owen
In the context of Spark, there are already things like RandomRDD and SQL randn() to generate random standard normal variables. If you want to do it directly, Commons Math is a good choice in the JVM, among others. Once you have a standard normal, just multiply by the stdev and add the mean to

Re: Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Daniel Darabos
Hi Antoaneta, I believe the difference is not due to Datasets being slower (DataFrames are just an alias to Datasets now), but rather using a user defined function for filtering vs using Spark builtins. The builtin can use tricks from Project Tungsten, such as only deserializing the "event_type"

Re: Generate random numbers from Normal Distribution with Specific Mean and Variance

2016-10-24 Thread Jörn Franke
https://github.com/rnowling/bigpetstore-data-generator > On 24 Oct 2016, at 19:17, Mich Talebzadeh wrote: > > thanks Jorn. > > I wish we had these libraries somewhere :) > > Dr Mich Talebzadeh > > LinkedIn >

Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-24 Thread Masood Krohy
Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark 1.6.0)? I do not see an option for this. If not, is there a way to get this IP address after the Spark app has

Spark Sql 2.0 throws null pointer exception

2016-10-24 Thread Selvam Raman
Hi All, Please help me. I have 10 (tables data) parquet file in s3. I am reading and storing as Dataset then registered as temp table. One table driving whole flow so i am doing below.(When i am triggering query from Code Base: SparkSession spark =

Need help with SVM

2016-10-24 Thread aditya1702
Hello, I am using linear SVM to train my model and generate a line through my data. However my model always predicts 1 for all the feature examples. Here is my code: print data_rdd.take(5) [LabeledPoint(1.0, [1.9643,4.5957]), LabeledPoint(1.0, [2.2753,3.8589]), LabeledPoint(1.0, [2.9781,4.5651]),

Re: Issues with reading gz files with Spark Streaming

2016-10-24 Thread Steve Loughran
On 22 Oct 2016, at 20:58, Nkechi Achara > wrote: I do not use rename, and the files are written to, and then moved to a directory on HDFS in gz format. in that case there's nothing obvious to mee. try logging at trace/debug the class:

Spark 2.0 - DataFrames vs Dataset performance

2016-10-24 Thread Antoaneta Marinova
Hello, I am using Spark 2.0 for performing filtering, grouping and counting operations on events data saved in parquet files. As the events schema has very nested structure I wanted to read them as scala beans to simplify the code but I noticed a severe performance degradation. Below you can find

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Cheng Lian
On 10/22/16 1:42 PM, Efe Selcuk wrote: Ah, looks similar. Next opportunity I get, I'm going to do a printSchema on the two datasets and see if they don't match up. I assume that unioning the underlying RDDs doesn't run into this problem because of less type checking or something along those

Re: Spark SQL is slower when DataFrame is cache in Memory

2016-10-24 Thread Kazuaki Ishizaki
Hi Chin Wei, I am sorry for being late to reply. Got it. Interesting behavior. How did you measure the time between 1st and 2nd events? Best Regards, Kazuaki Ishizaki From: Chin Wei Low To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: user@spark.apache.org Date:

Re: Writing to Parquet Job turns to wait mode after even completion of job

2016-10-24 Thread Cheng Lian
On 10/22/16 6:18 AM, Steve Loughran wrote: ... On Sat, Oct 22, 2016 at 3:41 AM, Cheng Lian > wrote: What version of Spark are you using and how many output files does the job writes out? By default, Spark versions before 1.6

Re: [Spark 2.0.0] error when unioning to an empty dataset

2016-10-24 Thread Efe Selcuk
All right, I looked at the schemas. There is one mismatching nullability, on a scala.Boolean. It looks like an empty Dataset with that *cannot* be nullable. However, when I run my code to generate the Dataset, the schema comes back with nullable = true. Effectively: scala> val empty =

Re: Getting the IP address of Spark Driver in yarn-cluster mode

2016-10-24 Thread Steve Loughran
On 24 Oct 2016, at 19:34, Masood Krohy > wrote: Hi everyone, Is there a way to set the IP address/hostname that the Spark Driver is going to be running on when launching a program through spark-submit in yarn-cluster mode (PySpark