Re: spark ML Recommender program

2017-05-18 Thread Nick Pentreath
Could you try setting the checkpoint interval for ALS (try 3, 5 say) and see what the effect is? On Thu, 18 May 2017 at 07:32 Mark Vervuurt wrote: > If you are running locally try increasing driver memory to for example 4G > en executor memory to 3G. > Regards, Mark >

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread Jörn Franke
You can also write it into a file and view it using your favorite viewer/editor > On 18. May 2017, at 04:55, kant kodali wrote: > > Hi All, > > How to see the full contents of dataset or dataframe is structured streaming > just like we normally with df.show(false)? Is

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread kant kodali
so for console sink it is not possible? On Wed, May 17, 2017 at 11:30 PM, Jörn Franke wrote: > You can also write it into a file and view it using your favorite > viewer/editor > > On 18. May 2017, at 04:55, kant kodali wrote: > > Hi All, > > How to

Re: spark ML Recommender program

2017-05-18 Thread Nick Pentreath
It sounds like this may be the same as https://issues.apache.org/jira/browse/SPARK-20402 On Thu, 18 May 2017 at 08:16 Nick Pentreath wrote: > Could you try setting the checkpoint interval for ALS (try 3, 5 say) and > see what the effect is? > > > > On Thu, 18 May 2017

Spark Structured Streaming is taking too long to process 2KB messages

2017-05-18 Thread kant kodali
Hi All, Here is my code. Dataset df = ds.select(functions.from_json(new Column("value").cast("string"), getSchema()).as("payload")); Dataset df1 = df.selectExpr("payload.data.*"); StreamingQuery query = df1.writeStream().outputMode("append").option("truncate",

Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran
On 18 May 2017, at 05:29, lucas.g...@gmail.com wrote: Steve, just to clarify: "FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way better on high-performance reads, especially if you are working with column data and can set the

Optimizing dataset joins

2017-05-18 Thread Daniel Haviv
Hi, With RDDs it was possible to define a partitioner for two RDDS and given that two RDDs have the same partitioner, a join operation would be performed local to the partition without shuffling. Can dataset joins be optimized in the same way ? Is it enough to repartition two datasets on the the

Re: checkpointing without streaming?

2017-05-18 Thread Neelesh Sambhajiche
That is exactly what we are currently doing - storing it in a csv file. However, as checkpointing permanently writes to disk, if we use checkpointing along with saving the RDD to a text file, the data gets stored twice on the disk. That is why I was looking for a way to read the checkpointed data

Re: spark cluster performance decreases by adding more nodes

2017-05-18 Thread Junaid Nasir
I can see tasks are equally dividing between nodes, how to check if one node is getting all the traffic? also I get similar results when querying just df.count(). thank you for your time :) On Wed, May 17, 2017 at 8:32 PM, Jörn Franke wrote: > The issue might be group by ,

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread kant kodali
Looks like there is .option("truncate", "false") On Wed, May 17, 2017 at 11:30 PM, Jörn Franke wrote: > You can also write it into a file and view it using your favorite > viewer/editor > > On 18. May 2017, at 04:55, kant kodali wrote: > > Hi All, > >

Re: spark cluster performance decreases by adding more nodes

2017-05-18 Thread jnasir
One cassandra node. Best Regards, Junaid Nasir > > On May 18, 2017 at 3:56 AM, mailto:guha.a...@gmail.com)> wrote: > > > > How many nodes do you have in casandra cluster? > > > > > On Thu, 18 May 2017 at 1:33 am, Jörn Franke

Re: Not able pass 3rd party jars to mesos executors

2017-05-18 Thread Michael Gummelt
No, --jars doesn't work in cluster mode on Mesos. We need to document that better. Do you have some problem that can't be solved by bundling your dependency into your application (i.e. uberjar)? On Tue, May 16, 2017 at 10:00 PM, Satya Narayan1 < satyanarayan.pa...@gmail.com> wrote: > Hi , Is

SparkAppHandle - get Input and output streams

2017-05-18 Thread Nipun Arora
Hi, I wanted to know how to get the the input and output streams from SparkAppHandle? I start application like the following: SparkAppHandle sparkAppHandle = sparkLauncher.startApplication(); I have used the following previously to capture inputstream from error and output streams, but I would

Re: SparkAppHandle - get Input and output streams

2017-05-18 Thread Marcelo Vanzin
On Thu, May 18, 2017 at 10:10 AM, Nipun Arora wrote: > I wanted to know how to get the the input and output streams from > SparkAppHandle? You can't. You can redirect the output, but not directly get the streams. -- Marcelo

Re: How to see the full contents of dataset or dataframe is structured streaming?

2017-05-18 Thread Michael Armbrust
You can write it to the memory sink. df.writeStream.format("memory").queryName("myStream").start() spark.table("myStream").show() On Wed, May 17, 2017 at 7:55 PM, kant kodali wrote: > Hi All, > > How to see the full contents of dataset or dataframe is structured >

Re: checkpointing without streaming?

2017-05-18 Thread Tathagata Das
You can use *SparkContext.checkpointFile()*. However note that the checkpoint file contains Java serialized data. So if your data types change in between writing and reading of the checkpoint file for whatever reason (Spark version change, your code was recompiled, etc.), you may not be able to

Re: Spark Structured Streaming is taking too long to process 2KB messages

2017-05-18 Thread kant kodali
ok so the problem really was I was compiling with 2.1.0 jars and at run time supplying 2.1.1. once I changed to 2.1.1 at compile time as well it seem to work fine and I can see all my 75 fields. On Thu, May 18, 2017 at 2:39 AM, kant kodali wrote: > Hi All, > > Here is my

Forcing either Hive or Spark SQL representation for metastore

2017-05-18 Thread Justin Miller
Hello, I was wondering if there were a way to force one representation or another for the Hive metastore. Some of our data can’t be parsed with the Hive method so it switches over to the Spark SQL method, leaving some of our data stored in Spark SQL format and some in Hive format. It’d be nice

IOT in Spark

2017-05-18 Thread Gaurav1809
Hello gurus, How exactly it works in real world scenarios when it come to read data from IOT devices (say for example censors at in/out gate in huge mall)? Can we do it in Spark? Do we need to use any other tool/utility (kafka???) to read data from those censors and then process them in Spark?

Re: IOT in Spark

2017-05-18 Thread Bharath Mundlapudi
Hi Gaurav, Answer is - yes; you can do it on Spark. Note that, first you need to understand what Spark is used for. For the problem statement you mentioned, you need many more technology components - Kafka, Spark Streaming, and Spark - in addition to IoT related software at the edge and gateway.

How does spark hiveserver dynamically update function dependent jar?

2017-05-18 Thread 李斌松
Create a temporary function, reference HDFS on the jar file, update the jar file, not immediately effective, need to restart hiveserver

Re: IOT in Spark

2017-05-18 Thread Kuchekar
Hi Gaurav, You might want to look for Lambda Architecture with Spark. https://www.youtube.com/watch?v=xHa7pA94DbA Regards, Kuchekar, Nilesh On Thu, May 18, 2017 at 8:58 PM, Gaurav1809 wrote: > Hello gurus, > > How exactly it works in real world