Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
Is there a Kafka sink for Spark Structured Streaming ? Sent from my iPhone

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kant kodali
Thanks! On Fri, May 19, 2017 at 4:50 PM, Tathagata Das wrote: > Should release by the end of this month. > > On Fri, May 19, 2017 at 4:07 PM, kant kodali wrote: > >> Hi Patrick, >> >> I am using 2.1.1 and I tried the above code you sent and I

GraphFrames 0.5.0 - critical bug fix + other improvements

2017-05-19 Thread Joseph Bradley
Hi Spark community, I'd like to announce a new release of GraphFrames, a Spark Package for DataFrame-based graphs! *We strongly encourage all users to use this latest release for the bug fix described below.* *Critical bug fix* This release fixes a bug in indexing vertices. This may have

Spark Streaming: Custom Receiver OOM consistently

2017-05-19 Thread Manish Malhotra
Hello, have implemented Java based custom receiver, which consumes from messaging system say JMS. once received message, I call store(object) ... Im storing spark Row object. it run for around 8 hrs, and then goes OOM, and OOM is happening in receiver nodes. I also tried to run multiple

SparkSQL not able to read a empty table location

2017-05-19 Thread Bajpai, Amit X. -ND
Hi, I have a hive external table with the S3 location having no files (but the S3 location directory does exists). When I am trying to use Spark SQL to count the number of records in the table it is throwing error saying “File s3n://data/xyz does not exist. null/0”. select * from tablex limit

Re: [Spark Streamiing] Streaming job failing consistently after 1h

2017-05-19 Thread Manish Malhotra
Im also facing same problem. I have implemented Java based custom receiver, which consumes from messaging system say JMS. once received message, I call store(object) ... Im storing spark Row object. it run for around 8 hrs, and then goes OOM, and OOM is happening in receiver nodes. I also tried

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Patrick McGloin
# Write key-value data from a DataFrame to a Kafka topic specified in an option query = df \ .selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value") \ .writeStream \ .format("kafka") \ .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \ .option("topic",

java.lang.OutOfMemoryError

2017-05-19 Thread Kürşat Kurt
Hi; I am trying multiclass text classification with Randomforest Classifier on my local computer(16 GB RAM, 4 physical cores ). When i run with the parameters below, i am getting "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. spark-submit --driver-memory 1G --driver-memory

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kanth909
Hi! Is this possible possible in spark 2.1.1? Sent from my iPhone > On May 19, 2017, at 5:55 AM, Patrick McGloin > wrote: > > # Write key-value data from a DataFrame to a Kafka topic specified in an > option > query = df \ > .selectExpr("CAST(userId AS STRING)

Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesmai4
Hi,I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a directory structure on my local disk, even nested directories. My stand alone Java based NLP parser can read input files, extract

Reading PDF/text/word file efficiently with Spark

2017-05-19 Thread tesm...@gmail.com
Hi, I am doing NLP (Natural Language Processing) processing on my data. The data is in form of files that can be of type PDF/Text/Word/HTML. These files are stored in a directory structure on my local disk, even nested directories. My stand alone Java based NLP parser can read input files, extract

Re: Refreshing a persisted RDD

2017-05-19 Thread Sudhir Menon
Part of the problem here is that the static dataframe is designed to be used a read only abstraction in Spark, and updating that requires the user to drop the dataframe holding the reference data and recreate it. And in order for the join to use the recreated dataframe, the query has to be

Bizarre UI Behavior after migration

2017-05-19 Thread Miles Crawford
Hey ya'll, Trying to migrate from Spark 1.6.1 to 2.1.0. I use EMR, and launched a new cluster using EMR 5.5, which runs spark 2.1.0. I updated my dependencies, and fixed a few API changes related to accumulators, and presto! my application was running on the new cluster. But the application UI

Re: Spark UI shows Jobs are processing, but the files are already written to S3

2017-05-19 Thread Miles Crawford
Could I be experiencing the same thing? https://www.dropbox.com/s/egtj1056qeudswj/sparkwut.png?dl=0 On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal wrote: > I think that is a bug. I have seen that a lot especially with long running > jobs where Spark skips a lot of

Shuffle read is very slow in spark

2017-05-19 Thread KhajaAsmath Mohammed
Hi , I am in weird situation where the spark job behaves abnormal. sometimes it runs very fast and some times it takes long time to complete. Not sure what is going on. Cluster is free most of the times. Below image shows the shuffle read is still taking more than 3 hours to write data back

Re: Spark-SQL collect function

2017-05-19 Thread Aakash Basu
Well described​, thanks! On 04-May-2017 4:07 AM, "JayeshLalwani" wrote: > In any distributed application, you scale up by splitting execution up on > multiple machines. The way Spark does this is by slicing the data into > partitions and spreading them on multiple

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread Aakash Basu
Hey all, A reply on this would be great! Thanks, A.B. On 17-May-2017 1:43 AM, "Daniel Siegmann" wrote: > When using spark.read on a large number of small files, these are > automatically coalesced into fewer partitions. The only documentation I can > find on

RE: IOT in Spark

2017-05-19 Thread Lohith Samaga M
Hi Gaurav, You can process IoT data using Spark. But where will you store the raw/processed data - Cassandra, Hive, HBase? You might want to look at the Hadoop cluster for data storage and processing (Spark using Yarn). For processing streaming data, you might also

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread kant kodali
Hi Patrick, I am using 2.1.1 and I tried the above code you sent and I get "java.lang.UnsupportedOperationException: Data source kafka does not support streamed writing" so yeah this probably works only from Spark 2.2 onwards. I am not sure when it officially releases. Thanks! On Fri, May 19,

Re: Documentation on "Automatic file coalescing for native data sources"?

2017-05-19 Thread ayan guha
I think like all other read operations, it is driven by input format used, and I think some variation of combine file input format is used by default. I think you can test it by force a particular input format which gets ine file per split, then you should end up with same number of partitions as

Re: Is there a Kafka sink for Spark Structured Streaming

2017-05-19 Thread Tathagata Das
Should release by the end of this month. On Fri, May 19, 2017 at 4:07 PM, kant kodali wrote: > Hi Patrick, > > I am using 2.1.1 and I tried the above code you sent and I get > > "java.lang.UnsupportedOperationException: Data source kafka does not > support streamed writing"