Fw:Re:Re: Spark1.4.0 compiling error with java1.6.0_20: sun.misc.Unsafe cannot be applied to (java.lang.Object,long,java.lang.Object,long,long)

2015-06-25 Thread Young
+all user Forwarding messages From: "Young" Date: 2015-06-26 10:31:19 To: "Ted Yu" Subject: Re:Re: Spark1.4.0 compiling error with java1.6.0_20: sun.misc.Unsafe cannot be applied to (java.lang.Object,long,java.lang.Object,long,long) Thanks for rep

spark sql thrift server: driver OOM

2016-09-20 Thread Young
ntrySet So, how to solve this problem to let my spark sql thrift server run longer except for applying for larger driver memory? Dose someone encounter the same situation? Sincerely, Young - To unsubscribe e

Why fetchSize should be bigger than 0 in JDBCOptions.scala?

2018-03-30 Thread Young
My executor will be OOM when use spark-sql to read data from Mysql. In sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala, I see the following lines.I'm wandering why JDBC_BATCH_FETCH_SIZE should be bigger than 0? val fetchSize = { val size = parameters.ge

[Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-11 Thread Alana Young
I am experimenting with creating and persisting ML pipelines using custom transformers (I am using Spark 3.1.2). I was able to create a transformer class (for testing purposes, I modeled the code off the SQLTransformer class) and save the pipeline model. When I attempt to load the saved pipeline

RE: Re: [Spark ML Pipeline]: Error Loading Pipeline Model with Custom Transformer

2022-01-12 Thread Alana Young
I have updated the gist (https://gist.github.com/ally1221/5acddd9650de3dc67f6399a4687893aa ). Please let me know if there are any additional questions.

Spark Unary Transformer Example

2022-01-13 Thread Alana Young
I am trying to run the Unary Transformer example provided by Spark (https://github.com/apache/spark/blob/v3.1.2/examples/src/main/scala/org/apache/spark/examples/ml/UnaryTransformerExample.scala

Re: StackOverflow in Spark

2016-06-01 Thread Matthew Young
Hi, It's related to the one fixed bug in Spark, jira ticket SPARK-6847 Matthew Yang On Wed, May 25, 2016 at 7:48 PM, Michel Hubert wrote: > > > Hi, > > > > > > I have an Spark application which generates StackOverflowError exceptions > after 3

Re: Spark sql insert hive table which method has the highest performance

2019-05-15 Thread Jelly Young
Hi, The document of DFWriter say that: Unlike `saveAsTable`, `insertInto` ignores the column names and just uses position-based For example: * * {{{ *scala> Seq((1, 2)).toDF("i", "j").write.mode("overwrite").saveAsTable("t1") *scala> Seq((3, 4)).toDF("j", "i").write.insertInto("t1") *

Spark streaming error in block pushing thread

2015-04-02 Thread Bill Young
<http://org.apache.spark.streaming.receiver.blockgenerator.org/> > $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) > at > org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) Has anyone run into this before? -- Bill Young Threat Stack | Infrastructure Engineer http://www.threatstack.com

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
or.scala:182) >> at >> org.apache.spark.streaming.receiver.BlockGenerator.org >> $apache$spark$streaming$receiver$BlockGenerator$$keepPushingBlocks(BlockGenerator.scala:155) >> at >> >> org.apache.spark.streaming.receiver.BlockGenerator$$anon$1.run(BlockGenerator.scala:87) >> >> >> Has anyone run into this before? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Error-in-block-pushing-thread-tp22356.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> > -- -- Bill Young Threat Stack | Senior Infrastructure Engineer http://www.threatstack.com

Re: Spark Streaming Error in block pushing thread

2015-04-02 Thread Bill Young
Sorry for the obvious typo, I have 4 workers with 16 cores total* On Thu, Apr 2, 2015 at 11:56 AM, Bill Young wrote: > Thank you for the response, Dean. There are 2 worker nodes, with 8 cores > total, attached to the stream. I have the following settings applied: > > spark.exe

Processing Large Images in Spark?

2015-04-06 Thread Patrick Young
Hi all, I'm new to Spark and wondering if it's appropriate to use for some image processing tasks on pretty sizable (~1 GB) images. Here is an example use case. Amazon recently put the entire Landsat8 archive in S3: http://aws.amazon.com/public-data-sets/landsat/ I have a bunch of GDAL based (

Spark and SQL Server

2015-07-17 Thread Young, Matthew T
ciated. Thank you for your time, -- Matthew Young - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Setting the vote rate in a Random Forest in MLlib

2015-12-16 Thread Young, Matthew T
One of our data scientists is interested in using Spark to improve performance in some random forest binary classifications, but isn't getting good enough results from MLlib's implementation of the random forest compared to R's randomforest library with the available parameters. She suggested th

RE: save DF to JDBC

2015-10-05 Thread Young, Matthew T
I’ve gotten it to work with SQL Server (with limitations; it’s buggy and doesn’t work with some types/operations). https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/DataFrameWriter.html is the Java API you are looking for; the JDBC method lets you write to JDBC databases. I ha

RE: Pulling data from a secured SQL database

2015-10-30 Thread Young, Matthew T
> Can the driver pull data and then distribute execution? Yes, as long as your dataset will fit in the driver's memory. Execute arbitrary code to read the data on the driver as you normally would if you were writing a single-node application. Once you have the data in a collection on the drive

RE: Very slow performance on very small record counts

2015-11-03 Thread Young, Matthew T
stream API. From: Cody Koeninger [mailto:c...@koeninger.org] Sent: Saturday, October 31, 2015 2:00 PM To: YOUNG, MATTHEW, T (Intel Corp) Subject: Re: Very slow performance on very small record counts Have you looked at jstack or the thread dump from the spark ui during that time to see what&#

RE: How can you sort wordcounts by counts in stateful_network_wordcount.py example

2015-11-12 Thread Young, Matthew T
You can use foreachRDD to get access to the batch API in streaming jobs. From: Amir Rahnama [mailto:amirrahn...@gmail.com] Sent: Thursday, November 12, 2015 12:11 AM To: ayan guha Cc: user Subj

RE: capture video with spark streaming

2015-11-30 Thread Young, Matthew T
Unless it’s a network camera with the ability to request specific frame numbers for read, the answer is that you will just read from the camera like you normally would without Spark inside of a foreachrdd() and parallelize the result out for processing once you have it in a collection in the dri

RE: is repartition very cost

2015-12-08 Thread Young, Matthew T
Shuffling large amounts of data over the network is expensive, yes. The cost is lower if you are just using a single node where no networking needs to be involved to do the repartition (using Spark as a multithreading engine). In general you need to do performance testing to see if a repartition

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
pi/python/pyspark.sql.html#pyspark.sql.DataFrameWriter>. I will keep this distinction in mind going forward. I guess we have to wait for Microsoft to release an SQL Server connector for Spark to resolve the other issues. Cheers, -- Matthew Young From: Dav

RE: Spark and SQL Server

2015-07-20 Thread Young, Matthew T
avies Liu [dav...@databricks.com] Sent: Monday, July 20, 2015 9:08 AM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: Spark and SQL Server Sorry for the confusing. What's the other issues? On Mon, Jul 20, 2015 at 8:26 AM, Young, Matthew T wrote: > Thanks Davies, that resolves t

RE: Would driver shutdown cause app dead?

2015-07-21 Thread Young, Matthew T
ZhuGe, If you run your program in the "cluster" deploy-mode you get resiliency against driver failure, though there are some steps you have to take in how you write your streaming job to allow for transparent resume. Netflix did a nice writeup of this resiliency here

Issue with column named "count" in a DataFrame

2015-07-22 Thread Young, Matthew T
.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Unknown Source) Is there a recommended workaround to the inability to filter on a column named count? Do I have to make a new DataFrame and rename the c

RE: Issue with column named "count" in a DataFrame

2015-07-23 Thread Young, Matthew T
015 4:26 PM To: Young, Matthew T Cc: user@spark.apache.org Subject: Re: Issue with column named "count" in a DataFrame Additionally have you tried enclosing count in `backticks`? On Wed, Jul 22, 2015 at 4:25 PM, Michael Armbrust mailto:mich...@databricks.com>> wrote: I believe

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
The built-in Spark JSON functionality cannot read normal JSON arrays. The format it expects is a bunch of individual JSON objects without any outer array syntax, with one complete JSON object per line of the input file. AFAIK your options are to read the JSON in the driver and parallelize it out

RE: How to read a Json file with a specific format?

2015-07-29 Thread Young, Matthew T
ot;:"30","Nrout":"0","up":null,"Crate":"2"},{"MLrate":"30","Nrout":"0","up":null,"Crate":"2"}]} {"IFAM":"EQR","KTM":143000640,"COL&

RE: IP2Location within spark jobs

2015-07-29 Thread Young, Matthew T
You can put the database files in a central location accessible to all the workers and build the GeoIP object once per-partition when you go to do a mapPartitions across your dataset, loading from the central location. ___ From: Filli Alem [alem.fi...@ti8m.ch] Sent: Wednesday, July 29, 2015

Getting number of physical machines in Spark

2015-08-27 Thread Young, Matthew T
What's the canonical way to find out the number of physical machines in a cluster at runtime in Spark? I believe SparkContext.defaultParallelism will give me the number of cores, but I'm interested in the number of NICs. I'm writing a Spark streaming application to ingest from Kafka with the Re

Reasonable performance numbers?

2015-09-24 Thread Young, Matthew T
Hello, I am doing performance testing with Spark Streaming. I want to know if the throughput numbers I am encountering are reasonable for the power of my cluster and Spark's performance characteristics. My job has the following processing steps: 1. Read 600 Byte JSON strings from a 7 brok