Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
Well, the difference is, a technical user writes the UDF and a non-technical user may use this built-in thing (misconfigure it) and shoot themselves in the foot. On Wed, Jul 1, 2020, 6:40 PM Andrew Melo wrote: > On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz wrote: > > > > I'

Re: REST Structured Steaming Sink

2020-07-01 Thread Burak Yavuz
I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc. Best, Burak On Wed, Jul 1, 2020 at 5:54 PM Holden Karau wrote:

Re: [spark streaming] checkpoint location feature for batch processing

2020-05-01 Thread Burak Yavuz
Hi Rishi, That is exactly why Trigger.Once was created for Structured Streaming. The way we look at streaming is that it doesn't have to be always real time, or 24-7 always on. We see streaming as a workflow that you have to repeat indefinitely. See this blog post for more details! https://databri

Re: Spark structured streaming - Fallback to earliest offset

2020-04-14 Thread Burak Yavuz
Just set `failOnDataLoss=false` as an option in readStream? On Tue, Apr 14, 2020 at 4:33 PM Ruijing Li wrote: > Hi all, > > I have a spark structured streaming app that is consuming from a kafka > topic with retention set up. Sometimes I face an issue where my query has > not finished processing

Re: ForEachBatch collecting batch to driver

2020-03-11 Thread Burak Yavuz
foreachBatch gives you the micro-batch as a DataFrame, which is distributed. If you don't call collect on that DataFrame, it shouldn't have any memory implications on the Driver. On Tue, Mar 10, 2020 at 3:46 PM Ruijing Li wrote: > Hi all, > > I’m curious on how foreachbatch works in spark struct

Re: Best way to read batch from Kafka and Offsets

2020-02-15 Thread Burak Yavuz
n is: Queries with streaming sources must be executed with >>> writeStream.start() >>> >>> But I thought forEachBatch would treat the batchDF as a static dataframe? >>> >>> Thanks, >>> RJ >>> >>> On Wed, Feb 5, 2020 at 12:48 AM Gourav Sen

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
a ForEachBatch write mode but > my thinking was at that point it was easier to read from kafka through > batch mode. > > Thanks, > RJ > > On Tue, Feb 4, 2020 at 4:20 PM Burak Yavuz wrote: > >> Hi Ruijing, >> >> Why do you not want to use structured

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
Hi Ruijing, Why do you not want to use structured streaming here? This is exactly why structured streaming + Trigger.Once was built, just so that you don't build that solution yourself. You also get exactly once semantics if you use the built in sinks. Best, Burak On Mon, Feb 3, 2020 at 3:15 PM

Re: Structured Streaming & Enrichment Broadcasts

2019-11-18 Thread Burak Yavuz
If you store the data that you're going to broadcast as a Delta table (see delta.io) and perform a stream-batch (where your Delta table is the batch) join, it will auto-update once the table receives any updates. Best, Burak On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey wrote: > Hello. > > We're

Re: Delta with intelligent upsett

2019-11-02 Thread Burak Yavuz
You can just add the target partitioning filter to your MERGE or UPDATE condition, e.g. MERGE INTO target USING source ON target.key = source.key AND target.year = year(current_date()) ... Best, Burak On Thu, Oct 31, 2019, 10:15 PM ayan guha wrote: > > Hi > > we have a scenario where we have a

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
lication is using offsets that are no longer available in >>> Kafka it will reset to earliest or latest offset available in Kafka and the >>> next request made to Kafka should provide proper data. But in case for all >>> micro-batches the offsets are getting reseted and

Re: Spark Kafka Streaming making progress but there is no data to be consumed

2019-09-11 Thread Burak Yavuz
Do you have rate limiting set on your stream? It may be that you are trying to process records that have passed the retention period within Kafka. On Wed, Sep 11, 2019 at 2:39 PM Charles vinodh wrote: > > Hi, > > I am trying to run a spark application ingesting data from Kafka using the > Spark

Re: Static partitioning in partitionBy()

2019-05-07 Thread Burak Yavuz
It depends on the data source. Delta Lake (https://delta.io) allows you to do it with the .option("replaceWhere", "c = c1"). With other file formats, you can write directly into the partition directory (tablePath/c=c1), but you lose atomicity. On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia wrote:

Re: Spark Structured Streaming using S3 as data source

2018-08-26 Thread Burak Yavuz
Yes, the checkpoint makes sure that you start off from where you left off. On Sun, Aug 26, 2018 at 2:22 AM sherif98 wrote: > I have data that is continuously pushed to multiple S3 buckets. I want to > set > up a structured streaming application that uses the S3 buckets as the data > source and d

Re: Kafka backlog - spark structured streaming

2018-07-30 Thread Burak Yavuz
If you don't set rate limiting through `maxOffsetsPerTrigger`, Structured Streaming will always process until the end of the stream. So number of records waiting to be processed should be 0 at the start of each trigger. On Mon, Jul 30, 2018 at 8:03 AM, Kailash Kalahasti < kailash.kalaha...@gmail.c

Re: Structured Streaming: distinct (Spark 2.2)

2018-03-19 Thread Burak Yavuz
I believe the docs are out of date regarding distinct. The behavior should be as follows: - Distinct should be applied across triggers - In order to prevent the state from growing indefinitely, you need to add a watermark - If you don't have a watermark, but your key space is small, that's also

Re: Infer JSON schema in structured streaming Kafka.

2017-12-11 Thread Burak Yavuz
In Spark 2.2, you can read from Kafka in batch mode, and then use the json reader to infer schema: val df = spark.read.format("kafka")... .select($"value.cast("string")) val json = spark.read.json(df) val schema = json.schema While the above should be slow (since you're reading almost all data

Re: Reload some static data during struct streaming

2017-11-13 Thread Burak Yavuz
I think if you don't cache the jdbc table, then it should auto-refresh. On Mon, Nov 13, 2017 at 1:21 PM, spark receiver wrote: > Hi > > I’m using struct streaming(spark 2.2) to receive Kafka msg ,it works > great. The thing is I need to join the Kafka message with a relative static > table stor

Re: Getting Message From Structured Streaming Format Kafka

2017-11-02 Thread Burak Yavuz
Hi Daniel, Several things: 1) Your error seems to suggest you're using a different version of Spark and a different version of the sql-kafka connector. Could you make sure they are on the same Spark version? 2) With Structured Streaming, you may remove everything related to a StreamingContext.

Re: Spark Structured Streaming not connecting to Kafka using kerberos

2017-10-16 Thread Burak Yavuz
Hi Darshan, How are you creating your kafka stream? Can you please share the options you provide? spark.readStream.format("kafka") .option(...) // all these please .load() On Sat, Oct 14, 2017 at 1:55 AM, Darshan Pandya wrote: > Hello, > > I'm using Spark 2.1.0 on CDH 5.8 with kafka 0.10.

Re: Structured streaming coding question

2017-09-20 Thread Burak Yavuz
> KafkaSink("hello2")).start(); > > query1.awaitTermination(); > query2.awaitTermination(); > sparkSession.streams().awaitAnyTermination(); > > > > > > On Tue, Sep 19, 2017 at 11:48 PM, Burak Yavuz wrote: > >> Hey Kant, >> >> That won't work

Re: Structured streaming coding question

2017-09-19 Thread Burak Yavuz
Hey Kant, That won't work either. Your second query may fail, and as long as your first query is running, you will not know. Put this as the last line instead: spark.streams.awaitAnyTermination() On Tue, Sep 19, 2017 at 10:11 PM, kant kodali wrote: > Looks like my problem was the order of awai

Re: [Structured Streaming] Trying to use Spark structured streaming

2017-09-11 Thread Burak Yavuz
Hi Eduardo, What you have written out is to output counts "as fast as possible" for windows of 5 minute length and with a sliding window of 1 minute. So for a record at 10:13, you would get that record included in the count for 10:09-10:14, 10:10-10:15, 10:11-10:16, 10:12-10:16, 10:13-10:18. Plea

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Burak Yavuz
x("time")* > > On Wed, Aug 30, 2017 at 10:38 AM, kant kodali wrote: > >> @Burak so how would the transformation or query would look like for the >> above example? I don't see flatMapGroupsWithState in the DataSet API >> Spark 2.1.1. I may be able to upgrade t

Re: How to select the entire row that has max timestamp for every key in Spark Structured Streaming 2.1.1?

2017-08-29 Thread Burak Yavuz
Hey TD, If I understood the question correctly, your solution wouldn't return the exact solution, since it also groups by on destination. I would say the easiest solution would be to use flatMapGroupsWithState, where you: .groupByKey(_.train) and keep in state the row with the maximum time. On T

Re: [SS] Why is a streaming aggregation required for complete output mode?

2017-08-18 Thread Burak Yavuz
Hi Jacek, The way the memory sink is architected at the moment is that it either appends a row (append/update mode) or replaces all rows (complete mode). When a user specifies a checkpoint location, the guarantee Structured Streaming provides is that output sinks will not lose data and will be abl

Re: [SPARK STRUCTURED STREAMING]: Alternatives to using Foreach sink in pyspark

2017-07-28 Thread Burak Yavuz
Hi Priyank, You may register them as temporary tables to use across language boundaries. Python: df = spark.readStream... # Python logic df.createOrReplaceTempView("tmp1") Scala: val df = spark.table("tmp1") df.writeStream .foreach(...) On Fri, Jul 28, 2017 at 3:06 PM, Priyank Shrivastava w

Re: What are some disadvantages of issuing a raw sql query to spark?

2017-07-25 Thread Burak Yavuz
I think Kant meant time windowing functions. You can use `window(TIMESTAMP, '24 hours', '24 hours')` On Tue, Jul 25, 2017 at 9:26 AM, Keith Chapman wrote: > Here is an example of a window lead function, > > select *, lead(someColumn1) over ( partition by someColumn2 order by > someColumn13 asc

Re: to_json not working with selectExpr

2017-07-16 Thread Burak Yavuz
Hi Matthew, Which Spark version are you using? The expression `to_json` was added in 2.2 with this commit: https://github.com/apache/spark/commit/0cdcf9114527a2c359c25e46fd6556b3855bfb28 Best, Burak On Sun, Jul 16, 2017 at 6:24 PM, Matthew cao wrote: > Hi all, > I just read the databricks blog

Re: Querying on Deeply Nested JSON Structures

2017-07-16 Thread Burak Yavuz
Have you checked out this blog post? https://databricks.com/blog/2017/02/23/working-complex-data-formats-structured-streaming-apache-spark-2-1.html Shows tools and tips on how to work with nested data. You can access data through `field1.field2.field3` and such with JSON. Best, Burak On Sat, Jul

Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Burak Yavuz
Hi Kaniska, In order to use append mode with aggregations, you need to set an event time watermark (using `withWatermark`). Otherwise, Spark doesn't know when to output an aggregation result as "final". Best, Burak On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal wrote: > Hi, > > My goal is to

Re: Spark SQL within a DStream map function

2017-06-16 Thread Burak Yavuz
Do you really need to create a DStream from the original messaging queue? Can't you just read them in a while loop or something on the driver? On Fri, Jun 16, 2017 at 1:01 PM, Mike Hugo wrote: > Hello, > > I have a web application that publishes JSON messages on to a messaging > queue that conta

Re: Structured Streaming from Parquet

2017-05-25 Thread Burak Yavuz
Hi Paul, >From what you're describing, it seems that stream1 is possibly generating tons of small files and stream2 is OOMing because it tries to maintain an in-memory list of files. Some notes/questions: 1. Parquet files are splittable, therefore having large parquet files shouldn't be a proble

Re: couple naive questions on Spark Structured Streaming

2017-05-22 Thread Burak Yavuz
Hi Kant, > > > 1. Can we use Spark Structured Streaming for stateless transformations > just like we would do with DStreams or Spark Structured Streaming is only > meant for stateful computations? > Of course you can do stateless transformations. Any map, filter, select, type of transformation is

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
exactly the same schema, but > one side support null and the other doesn't, this exception (in union > dataset) will be thrown? > > > > 2017-05-08 16:41 GMT-03:00 Burak Yavuz : > >> I also want to add that generally these may be caused by the >> `nullability` fie

Re: Why does dataset.union fails but dataset.rdd.union execute correctly?

2017-05-08 Thread Burak Yavuz
I also want to add that generally these may be caused by the `nullability` field in the schema. On Mon, May 8, 2017 at 12:25 PM, Shixiong(Ryan) Zhu wrote: > This is because RDD.union doesn't check the schema, so you won't see the > problem unless you run RDD and hit the incompatible column probl

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz
Hi Everett, IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and that's why you should be seeing the slowdown. Best, Burak On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson wrote: > Looks like the Dat

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
> 2017-02-06 14:25 GMT-08:00 Burak Yavuz : > >> Hi Egor, >> >> Structured Streaming handles all of its metadata itself, which files are >> actually valid, etc. You may use the "create table" syntax in SQL to treat >> it like a hive table, but it will han

Re: [Structured Streaming] Using File Sink to store to hive table.

2017-02-06 Thread Burak Yavuz
Hi Egor, Structured Streaming handles all of its metadata itself, which files are actually valid, etc. You may use the "create table" syntax in SQL to treat it like a hive table, but it will handle all partitioning information in its own metadata log. Is there a specific reason that you want to st

Re: eager? in dataframe's checkpoint

2017-01-31 Thread Burak Yavuz
rdd.isCheckpointed > will be false, and the count will be on the rdd before it was checkpointed. > what is the benefit of that? > > > On Thu, Jan 26, 2017 at 11:19 PM, Burak Yavuz wrote: > >> Hi, >> >> One of the goals of checkpointing is to cut the RDD lineage. Oth

Re: eager? in dataframe's checkpoint

2017-01-26 Thread Burak Yavuz
Hi, One of the goals of checkpointing is to cut the RDD lineage. Otherwise you run into StackOverflowExceptions. If you eagerly checkpoint, you basically cut the lineage there, and the next operations all depend on the checkpointed DataFrame. If you don't checkpoint, you continue to build the line

Re: Java heap error during matrix multiplication

2017-01-26 Thread Burak Yavuz
Hi, Have you tried creating more column blocks? BlockMatrix matrix = cmatrix.toBlockMatrix(100, 100); for example. Is your data randomly spread out, or do you generally have clusters of data points together? On Wed, Jan 25, 2017 at 4:23 AM, Petr Shestov wrote: > Hi all! > > I'm using Spark

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
shyla deshpande wrote: > Thanks Burak. But with BloomFilter, won't I be getting a false poisitve? > > On Wed, Jan 25, 2017 at 11:28 AM, Burak Yavuz wrote: > >> I noticed that 1 wouldn't be a problem, because you'll save the >> BloomFilter in the state. >&g

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
ve me 2 solutions > 1. Bloom filter --> problem in repopulating the bloom filter on restarts > 2. keeping the state of the unique ids > > Please elaborate on 2. > > > > On Wed, Jan 25, 2017 at 10:53 AM, Burak Yavuz wrote: > >> I don't have any sample code,

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
> On Wed, Jan 25, 2017 at 9:13 AM, Burak Yavuz wrote: > >> Off the top of my head... (Each may have it's own issues) >> >> If upstream you add a uniqueId to all your records, then you may use a >> BloomFilter to approximate if you've seen a row before. &

Re: How to make the state in a streaming application idempotent?

2017-01-25 Thread Burak Yavuz
Off the top of my head... (Each may have it's own issues) If upstream you add a uniqueId to all your records, then you may use a BloomFilter to approximate if you've seen a row before. The problem I can see with that approach is how to repopulate the bloom filter on restarts. If you are certain t

Re: Spark Streaming - join streaming and static data

2016-12-06 Thread Burak Yavuz
Hi Daniela, This is trivial with Structured Streaming. If your Kafka cluster is 0.10.0 or above, you may use Spark 2.0.2 to create a Streaming DataFrame from Kafka, and then also create a DataFrame using the JDBC connection, and you may join those. In Spark 2.1, there's support for a function call

Re: How to cause a stage to fail (using spark-shell)?

2016-06-18 Thread Burak Yavuz
Hi Jacek, Can't you simply have a mapPartitions task throw an exception or something? Are you trying to do something more esoteric? Best, Burak On Sat, Jun 18, 2016 at 5:35 AM, Jacek Laskowski wrote: > Hi, > > Following up on this question, is a stage considered failed only when > there is a F

Re: Any NLP lib could be used on spark?

2016-04-19 Thread Burak Yavuz
A quick search on spark-packages returns: http://spark-packages.org/package/databricks/spark-corenlp. You may need to build it locally and add it to your session by --jars. On Tue, Apr 19, 2016 at 10:47 AM, Gavin Yue wrote: > Hey, > > Want to try the NLP on the spark. Could anyone recommend any

Re: RethinkDB as a Datasource

2016-03-04 Thread Burak Yavuz
Hi, You can always write it as a data source and share it on Spark Packages. There are many data source connectors available already: http://spark-packages.org/?q=tags%3A%22Data%20Sources%22 Best, Burak On Fri, Mar 4, 2016 at 3:19 PM, pnakibar wrote: > Hi, > I see that there is no way to use R

Re: Calculation of histogram bins and frequency in Apache spark 1.6

2016-02-23 Thread Burak Yavuz
You could use the Bucketizer transformer in Spark ML. Best, Burak On Tue, Feb 23, 2016 at 9:13 AM, Arunkumar Pillai wrote: > Hi > Is there any predefined method to calculate histogram bins and frequency > in spark. Currently I take range and find bins then count frequency using > SQL query. > >

Re: Using SPARK packages in Spark Cluster

2016-02-12 Thread Burak Yavuz
Hello Gourav, The packages need to be loaded BEFORE you start the JVM, therefore you won't be able to add packages dynamically in code. You should use the --packages with pyspark before you start your application. One option is to add a `conf` that will load some packages if you are constantly goi

Re: Guidelines for writing SPARK packages

2016-02-01 Thread Burak Yavuz
Thanks for the reply David, just wanted to fix one part of your response: > If you > want to register a release for your package you will also need to push > the artifacts for your package to Maven central. > It is NOT necessary to push to Maven Central in order to make a release. There are many

Re: Redirect Spark Logs to Kafka

2016-02-01 Thread Burak Yavuz
You can use the KafkaLog4jAppender ( https://github.com/apache/kafka/blob/trunk/log4j-appender/src/main/java/org/apache/kafka/log4jappender/KafkaLog4jAppender.java ). Best, Burak On Mon, Feb 1, 2016 at 12:20 PM, Ashish Soni wrote: > Hi All , > > Please let me know how we can redirect spark logg

Re: Optimized way to multiply two large matrices and save output using Spark and Scala

2016-01-13 Thread Burak Yavuz
BlockMatrix.multiply is the suggested method of multiplying two large matrices. Is there a reason that you didn't use BlockMatrices? You can load the matrices and convert to and from RowMatrix. If it's in sparse format (i, j, v), then you can also use the CoordinateMatrix to load, BlockMatrix to m

Re: number of blocks in ALS/recommendation API

2015-12-17 Thread Burak Yavuz
Copying the first part from the scaladoc: " This is a blocked implementation of the ALS factorization algorithm that groups the two sets of factors (referred to as "users" and "products") into blocks and reduces communication by only sending one copy of each user vector to each product block on eac

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
6 branch as of commit >> db5165246f2888537dd0f3d4c5a515875c7358ed. That makes it much less >> serious of an issue, although it would be nice to know what the root cause >> is to avoid a regression. >> >> On Thu, Dec 10, 2015 at 4:03 PM Burak Yavuz wrote: >> >>

Re: Spark streaming with Kinesis broken?

2015-12-10 Thread Burak Yavuz
I've noticed this happening when there was some dependency conflicts, and it is super hard to debug. It seems that the KinesisClientLibrary version in Spark 1.5.2 is 1.3.0, but it is 1.2.1 in Spark 1.5.1. I feel like that seems to be the problem... Brian, did you verify that it works with the 1.6.

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
Not sure if it would be the most efficient, but maybe you can think of the filesystem as a key value store, and write each batch to a sub-directory, where the directory name is the batch time. If the directory already exists, then you shouldn't write it. Then you may have a following batch job that

Re: large, dense matrix multiplication

2015-11-13 Thread Burak Yavuz
Hi, The BlockMatrix multiplication should be much more efficient on the current master (and will be available with Spark 1.6). Could you please give that a try if you have the chance? Thanks, Burak On Fri, Nov 13, 2015 at 10:11 AM, Sabarish Sasidharan < sabarish.sasidha...@manthan.com> wrote: >

Re: Spark Packages Configuration Not Found

2015-11-11 Thread Burak Yavuz
Hi Jakob, > As another, general question, are spark packages the go-to way of extending spark functionality? Definitely. There are ~150 Spark Packages out there in spark-packages.org. I use a lot of them in every day Spark work. The number of released packages have steadily increased rate over the

Re: spark-submit --packages using different resolver

2015-10-03 Thread Burak Yavuz
Hi Jerry, The --packages feature doesn't support private repositories right now. However, in the case of s3, maybe it might work. Could you please try using the --repositories flag and provide the address: `$ spark-submit --packages my:awesome:package --repositories s3n://$aws_ak:$aws_sak@bucket/p

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-09-09 Thread Burak Yavuz
On Tue, Aug 25, 2015 at 1:57 PM, Burak Yavuz wrote: > >> Hmm. I have a lot of code on the local linear algebra operations using >> Spark's Matrix and Vector representations >> done for https://issues.apache.org/jira/browse/SPARK-6442. >> >> I can make a

Re: Calculating Min and Max Values using Spark Transformations?

2015-08-28 Thread Burak Yavuz
Or you can just call describe() on the dataframe? In addition to min-max, you'll also get the mean, and count of non-null and non-NA elements as well. Burak On Fri, Aug 28, 2015 at 10:09 AM, java8964 wrote: > Or RDD.max() and RDD.min() won't work for you? > > Yong > > --

Re: Adding/subtracting org.apache.spark.mllib.linalg.Vector in Scala?

2015-08-25 Thread Burak Yavuz
Hmm. I have a lot of code on the local linear algebra operations using Spark's Matrix and Vector representations done for https://issues.apache.org/jira/browse/SPARK-6442. I can make a Spark package with that code if people are interested. Best, Burak On Tue, Aug 25, 2015 at 10:54 AM, Kristina R

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
> > Best, > Roberto > > > On Mon, Aug 24, 2015 at 7:38 PM, Burak Yavuz wrote: > >> textFile is a lazy operation. It doesn't evaluate until you call an >> action on it, such as .count(). Therefore, you won't catch the exception >> there. >>

Re: Unable to catch SparkContext methods exceptions

2015-08-24 Thread Burak Yavuz
textFile is a lazy operation. It doesn't evaluate until you call an action on it, such as .count(). Therefore, you won't catch the exception there. Best, Burak On Mon, Aug 24, 2015 at 9:09 AM, Roberto Coluccio < roberto.coluc...@gmail.com> wrote: > Hello folks, > > I'm experiencing an unexpected

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-21 Thread Burak Yavuz
Thursday 20 August 2015 07:50 PM, Burak Yavuz wrote: > > Matrix.toBreeze is a private method. MLlib matrices have the same > structure as Breeze Matrices. Just create a new Breeze matrix like this > <https://github.com/apache/spark/blob/43e0135421b2262cbb0e06aae53523f663b4f959/ml

Re: Creating Spark DataFrame from large pandas DataFrame

2015-08-20 Thread Burak Yavuz
If you would like to try using spark-csv, please use `pyspark --packages com.databricks:spark-csv_2.11:1.2.0` You're missing a dependency. Best, Burak On Thu, Aug 20, 2015 at 1:08 PM, Charlie Hack wrote: > Hi, > > I'm new to spark and am trying to create a Spark df from a pandas df with > ~5 m

Re: Convert mllib.linalg.Matrix to Breeze

2015-08-20 Thread Burak Yavuz
Matrix.toBreeze is a private method. MLlib matrices have the same structure as Breeze Matrices. Just create a new Breeze matrix like this . Best, B

Re: Unit Testing

2015-08-13 Thread Burak Yavuz
I would recommend this spark package for your unit testing needs ( http://spark-packages.org/package/holdenk/spark-testing-base). Best, Burak On Thu, Aug 13, 2015 at 5:51 AM, jay vyas wrote: > yes there certainly is, so long as eclipse has the right plugins and so on > to run scala programs. Y

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
In addition, you do not need to use --jars with --packages. --packages will get the jar for you. Best, Burak On Mon, Aug 3, 2015 at 9:01 AM, Burak Yavuz wrote: > Hi, there was this issue for Scala 2.11. > https://issues.apache.org/jira/browse/SPARK-7944 > It should be fixed on mast

Re: Cannot Import Package (spark-csv)

2015-08-03 Thread Burak Yavuz
Hi, there was this issue for Scala 2.11. https://issues.apache.org/jira/browse/SPARK-7944 It should be fixed on master branch. You may be hitting that. Best, Burak On Sun, Aug 2, 2015 at 9:06 PM, Ted Yu wrote: > I tried the following command on master branch: > bin/spark-shell --packages com.da

Re: Which directory contains third party libraries for Spark

2015-07-28 Thread Burak Yavuz
Hey Stephen, In case these libraries exist on the client as a form of maven library, you can use --packages to ship the library and all it's dependencies, without building an uber jar. Best, Burak On Tue, Jul 28, 2015 at 10:23 AM, Marcelo Vanzin wrote: > Hi Stephen, > > There is no such direct

Re: How to unpersist RDDs generated by ALS/MatrixFactorizationModel

2015-07-22 Thread Burak Yavuz
Hi Jonathan, I believe calling persist with StorageLevel.NONE doesn't do anything. That's why the unpersist has an if statement before it. Could you give more information about your setup please? Number of cores, memory, number of partitions of ratings_train? Thanks, Burak On Wed, Jul 22, 2015 a

Re: RowId unique key for Dataframes

2015-07-21 Thread Burak Yavuz
Would monotonicallyIncreasingId work for you? Best, Burak On Tue, Jul 21, 2015 at 4:55 PM, Srikanth wrote: > Hello, > > I'm creating dataframes fro

Re: LinearRegressionWithSGD Outputs NaN

2015-07-21 Thread Burak Yavuz
Hi, Could you please decrease your step size to 0.1, and also try 0.01? You could also try running L-BFGS, which doesn't have step size tuning, to get better results. Best, Burak On Tue, Jul 21, 2015 at 2:59 AM, Naveen wrote: > Hi , > > I am trying to use LinearRegressionWithSGD on Million Song

Re: Running mllib from R in Spark 1.4

2015-07-15 Thread Burak Yavuz
Hi, There is no MLlib support in SparkR in 1.4. There will be some support in 1.5. You can check these JIRAs for progress: https://issues.apache.org/jira/browse/SPARK-6805 https://issues.apache.org/jira/browse/SPARK-6823 Best, Burak On Wed, Jul 15, 2015 at 6:00 AM, madhu phatak wrote: > Hi, > I

Re: MLlib LogisticRegressionWithLBFGS error

2015-07-14 Thread Burak Yavuz
Hi, Is this in LibSVM format? If so, the indices should be sorted in increasing order. It seems like they are not sorted. Best, Burak On Tue, Jul 14, 2015 at 7:31 PM, Vi Ngo Van wrote: > Hi All, > I've met a issue with MLlib when i use LogisticRegressionWithLBFGS > > my sample data : > > *0 86

Re: creating a distributed index

2015-07-14 Thread Burak Yavuz
Hi Swetha, IndexedRDD is available as a package on Spark Packages . Best, Burak On Tue, Jul 14, 2015 at 5:23 PM, swetha wrote: > Hi Ankur, > > Is IndexedRDD available in Spark 1.4.0? We would like to use this in Spark > Streaming to do

Re: Strange behavoir of pyspark with --jars option

2015-07-14 Thread Burak Yavuz
Hi, I believe the HiveContext uses a different class loader. It then falls back to the system class loader if it can't find the classes in the context class loader. The system class loader contains the classpath passed through --driver-class-path and spark.executor.extraClassPath. The JVM is alread

Re: To access elements of a org.apache.spark.mllib.linalg.Vector

2015-07-14 Thread Burak Yavuz
Hi Dan, You could zip the indices with the values if you like. ``` val sVec = sparseVector(1).asInstanceOf[ org.apache.spark.mllib.linalg.SparseVector] val map = sVec.indices.zip(sVec.values).toMap ``` Best, Burak On Tue, Jul 14, 2015 at 12:23 PM, Dan Dong wrote: > Hi, > I'm wondering how t

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
irmal Fernando wrote: > I'm using; > > org.apache.spark.mllib.clustering.KMeans.train(data.rdd(), 3, 20); > > Cpu cores: 8 (using default Spark conf thought) > > On partitions, I'm not sure how to find that. > > On Mon, Jul 13, 2015 at 11:30 PM, Burak Yavuz

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
1.4 > > On Mon, Jul 13, 2015 at 10:28 PM, Burak Yavuz wrote: > >> Hi, >> >> How are you running K-Means? What is your k? What is the dimension of >> your dataset (columns)? Which Spark version are you using? >> >> Thanks, >> Burak >&g

Re: [MLLib][Kmeans] KMeansModel.computeCost takes lot of time

2015-07-13 Thread Burak Yavuz
Hi, How are you running K-Means? What is your k? What is the dimension of your dataset (columns)? Which Spark version are you using? Thanks, Burak On Mon, Jul 13, 2015 at 2:53 AM, Nirmal Fernando wrote: > Hi, > > For a fairly large dataset, 30MB, KMeansModel.computeCost takes lot of > time (16

Re: Unit tests of spark application

2015-07-10 Thread Burak Yavuz
I can +1 Holden's spark-testing-base package. Burak On Fri, Jul 10, 2015 at 12:23 PM, Holden Karau wrote: > Somewhat biased of course, but you can also use spark-testing-base from > spark-packages.org as a basis for your unittests. > > On Fri, Jul 10, 2015 at 12:03 PM, Daniel Siegmann < > danie

Re: How to ignore features in mllib

2015-07-09 Thread Burak Yavuz
If you use the Pipelines Api with DataFrames, you select which columns you would like to train on using the VectorAssembler. While using the VectorAssembler, you can choose not to select some features if you like. Best, Burak On Thu, Jul 9, 2015 at 10:38 AM, Arun Luthra wrote: > Is it possible

Re: spark-submit can not resolve spark-hive_2.10

2015-07-07 Thread Burak Yavuz
spark-hive is excluded when using --packages, because it can be included in the spark-assembly by adding -Phive during mvn package or sbt assembly. Best, Burak On Tue, Jul 7, 2015 at 8:06 AM, Hao Ren wrote: > I want to add spark-hive as a dependence to submit my job, but it seems > that > spark

Re: Spark 1.4 MLLib Bug?: Multiclass Classification "requirement failed: sizeInBytes was negative"

2015-07-03 Thread Burak Yavuz
How many partitions do you have? It might be that one partition is too large, and there is Integer overflow. Could you double your number of partitions? Burak On Fri, Jul 3, 2015 at 4:41 AM, Danny wrote: > hi, > > i want to run a multiclass classification with 390 classes on120k label > points(

Re: coalesce on dataFrame

2015-07-01 Thread Burak Yavuz
You can use df.repartition(1) in Spark 1.4. See here . Best, Burak On Wed, Jul 1, 2015 at 3:05 AM, Olivier Girardot wrote: > PySpark or Spark (scala) ? > When you use coalesce with a

Re: breeze.linalg.DenseMatrix not found

2015-06-30 Thread Burak Yavuz
How does your build file look? Are you possibly using wrong Scala versions? Have you added Breeze as a dependency to your project? If so which version? Thanks, Burak On Mon, Jun 29, 2015 at 3:45 PM, AlexG wrote: > I get the same error even when I define covOperator not to use a matrix at > all:

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-30 Thread Burak Yavuz
.hbase:hbase:1.1.1, junit:junit:x --repositories http://some.other.repo,http://some.other.repo2 $YOUR_JAR Best, Burak On Mon, Jun 29, 2015 at 11:33 PM, SLiZn Liu wrote: > Hi Burak, > > Is `--package` flag only available for maven, no sbt support? > > On Tue, Jun 30, 2015 at 2:26

Re: Can Dependencies Be Resolved on Spark Cluster?

2015-06-29 Thread Burak Yavuz
You can pass `--packages your:comma-separated:maven-dependencies` to spark submit if you have Spark 1.3 or greater. Best regards, Burak On Mon, Jun 29, 2015 at 10:46 PM, SLiZn Liu wrote: > Hey Spark Users, > > I'm writing a demo with Spark and HBase. What I've done is packaging a > **fat jar**:

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
ould restarted the transformation ended up updating accumulator more than > once? > > Best, > Wei > > 2015-06-24 13:23 GMT-07:00 Burak Yavuz : > >> Hi Wei, >> >> For example, when a straggler executor gets killed in the middle of a map >> operation

Re: Understanding accumulator during transformations

2015-06-24 Thread Burak Yavuz
Hi Wei, For example, when a straggler executor gets killed in the middle of a map operation and it's task is restarted at a different instance, the accumulator will be updated more than once. Best, Burak On Wed, Jun 24, 2015 at 1:08 PM, Wei Zhou wrote: > Quoting from Spark Program guide: > > "

Re: Confusion matrix for binary classification

2015-06-22 Thread Burak Yavuz
Hi, In Spark 1.4, you may use DataFrame.stat.crosstab to generate the confusion matrix. This would be very simple if you are using the ML Pipelines Api, and are working with DataFrames. Best, Burak On Mon, Jun 22, 2015 at 4:21 AM, CD Athuraliya wrote: > Hi, > > I am looking for a way to get co

Re: SparkSubmit with Ivy jars is very slow to load with no internet access

2015-06-18 Thread Burak Yavuz
Hey Nathan, I like the first idea better. Let's see what others think. I'd be happy to review your PR afterwards! Best, Burak On Thu, Jun 18, 2015 at 9:53 PM, Nathan McCarthy < nathan.mccar...@quantium.com.au> wrote: > Hey, > > Spark Submit adds maven central & spark bintray to the ChainResol

Re: --packages & Failed to load class for data source v1.4

2015-06-14 Thread Burak Yavuz
Hi Don, This seems related to a known issue, where the classpath on the driver is missing the related classes. This is a bug in py4j as py4j uses the System Classloader rather than Spark's Context Classloader. However, this problem existed in 1.3.0 as well, therefore I'm curious whether it's the sa

Re: How to read avro in SparkR

2015-06-13 Thread Burak Yavuz
Hi, Not sure if this is it, but could you please try "com.databricks.spark.avro" instead of just "avro". Thanks, Burak On Jun 13, 2015 9:55 AM, "Shing Hing Man" wrote: > Hi, > I am trying to read a avro file in SparkR (in Spark 1.4.0). > > I started R using the following. > matmsh@gauss:~$ spa

Re: foreach plus accumulator Vs mapPartitions performance

2015-05-21 Thread Burak Yavuz
Or you can simply use `reduceByKeyLocally` if you don't want to worry about implementing accumulators and such, and assuming that the reduced values will fit in memory of the driver (which you are assuming by using accumulators). Best, Burak On Thu, May 21, 2015 at 2:46 PM, ben wrote: > Hi, eve

  1   2   >