Re: [Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-19 Thread Eric Hanchrow
We’ve discovered a workaround for this; it’s described here<https://issues.apache.org/jira/browse/HADOOP-18521>. From: Eric Hanchrow Date: Thursday, December 8, 2022 at 17:03 To: user@spark.apache.org Subject: [Spark SQL]: unpredictable errors: java.io.IOException: can not read

[Spark SQL]: unpredictable errors: java.io.IOException: can not read class org.apache.parquet.format.PageHeader

2022-12-08 Thread Eric Hanchrow
My company runs java code that uses Spark to read from, and write to, Azure Blob storage. This code runs more or less 24x7. Recently we've noticed a few failures that leave stack traces in our logs; what they have in common are exceptions that look variously like Caused by: java.io.IOExcep

Re: Naming files while saving a Dataframe

2021-08-12 Thread Eric Beabes
pends on Hadoop writing files. You can try to set the > Hadoop property: mapreduce.output.basename > > > https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration-- > > > Am 18.07.2021 um 01:15 schrieb Eric Beabes : > >  >

Replacing BroadcastNestedLoopJoin

2021-08-12 Thread Eric Beabes
We’ve two datasets that look like this: Dataset A: App specific data that contains (among other fields): ip_address Dataset B: Location data that contains start_ip_address_int, end_ip_address_int, latitude, longitude We’re (left) joining these two datasets as: A.ip_address >= B.start_ip_address

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss,

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
; > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
yan guha wrote: > IMHO - this is a bad idea esp in failure scenarios. > > How about creating a subfolder each for the jobs? > > On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes > wrote: > >> We've two (or more) jobs that write data into the same directory via a >> Da

Naming files while saving a Dataframe

2021-07-16 Thread Eric Beabes
We've two (or more) jobs that write data into the same directory via a Dataframe.save method. We need to be able to figure out which job wrote which file. Maybe provide a 'prefix' to the file names. I was wondering if there's any 'option' that allows us to do this. Googling didn't come up with any

Re: CVEs

2021-07-12 Thread Eric Richardson
out. Thanks, Eric On Mon, Jun 21, 2021 at 5:45 PM Eric Richardson wrote: > Ok, that sounds like a plan. I will gather what I found and either reach > out on the security channel and/or try and upgrade with a pull request. > > Thanks for pointing me in the right direction. > &

Re: Unsubscribe

2021-07-11 Thread Eric Wang
Unsubscribe On Sun, Jul 11, 2021 at 9:59 PM Rishi Raj Tandon wrote: > Unsubscribe >

Re: CVEs

2021-06-21 Thread Eric Richardson
>> a valid vulnerability the best path forward is likely reaching out to >> private@ to figure out how to do a security release. >> >> On Mon, Jun 21, 2021 at 4:42 PM Eric Richardson >> wrote: >> >>> Thanks for the quick reply. Yes, since it is included

Re: CVEs

2021-06-21 Thread Eric Richardson
ly their own Jackson. > > If someone had a legit view that this is potentially more serious I think > we could _probably backport that update, but Jackson can be a little bit > tricky with compatibility IIRC so would just bear some testing. > > > On Mon, Jun 21, 2021 at 5:27 PM

CVEs

2021-06-21 Thread Eric Richardson
://github.com/FasterXML/jackson-databind/issues/2589 - but Spark supplies 2.10.0. Thanks, Eric

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
ay be significant. But it seems like the > simplest thing and will probably work fine. > > On Tue, May 25, 2021 at 4:34 PM Eric Beabes > wrote: > >> Right... but the problem is still the same, no? Those N Jobs (aka Futures >> or Threads) will all be running on the Driver.

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
arquet, for > example. You would just have 10s or 100s of those jobs running at the same > time. You have to write a bit of async code to do it, but it's pretty easy > with Scala Futures. > > On Tue, May 25, 2021 at 3:31 PM Eric Beabes > wrote: > >> Here's

Re: Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
> > val df = spark.read.option(“mergeSchema”, “true”).load(listOfPaths) > > > > *From: *Eric Beabes > *Date: *Tuesday, May 25, 2021 at 1:24 PM > *To: *spark-user > *Subject: *Reading parquet files in parallel on the cluster > > > > I've a use case in which

Reading parquet files in parallel on the cluster

2021-05-25 Thread Eric Beabes
I've a use case in which I need to read Parquet files in parallel from over 1000+ directories. I am doing something like this: val df = list.toList.toDF() df.foreach(c => { val config = *getConfigs()* doSomething(spark, config) }) In the doSomething method, when I try to

NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread Eric Beabes
I keep getting the following exception when I am trying to read a Parquet file from a Path on S3 in Spark/Scala. Note: I am running this on EMR. java.lang.NullPointerException at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144) at org.apache.spark

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> ‪On Mon, 3 May 2021 at 18:27, ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < >> yur...@gmail.com> wrote:‬ >> >>> You

Re: Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
wn risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such lo

Stream which needs to be “joined” with another Stream of “Reference” data.

2021-05-03 Thread Eric Beabes
I would like to develop a Spark Structured Streaming job that reads messages in a Stream which needs to be “joined” with another Stream of “Reference” data. For example, let’s say I’m reading messages from Kafka coming in from (lots of) IOT devices. This message has a ‘device_id’. We have a DEVICE

Spark doesn't add _SUCCESS file when 'partitionBy' is used

2021-04-05 Thread Eric Beabes
When I do the following, Spark( 2.4) doesn't put _SUCCESS file in the partition directory: val outputPath = s"s3://mybucket/$table" df .orderBy(time) .coalesce(numFiles) .write .partitionBy("partitionDate") .mode("overwrite") .format("parquet") .save(outputPath) But when I remove 'partitionBy'

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Eric Beabes
fter a long time. Some memory leak in your app > putting GC/memory pressure on the JVM, etc too. > > On Thu, Jan 21, 2021 at 5:13 AM Eric Beabes > wrote: > >> Hello, >> >> My Spark Structured Streaming application was performing well for quite >> some ti

Re: Only one Active task in Spark Structured Streaming application

2021-01-21 Thread Eric Beabes
t; >> https://about.me/JacekLaskowski >> "The Internals Of" Online Books <https://books.japila.pl/> >> Follow me on https://twitter.com/jaceklaskowski >> >> <https://twitter.com/jaceklaskowski> >> >> >> On Thu, Jan 21, 2021

Re: Data source v2 streaming sinks does not support Update mode

2021-01-19 Thread Eric Beabes
Will do, thanks! On Tue, Jan 19, 2021 at 1:39 PM Gabor Somogyi wrote: > Thanks for double checking the version. Please report back with 3.1 > version whether it works or not. > > G > > > On Tue, 19 Jan 2021, 07:41 Eric Beabes, wrote: > >> Confirmed. The cluster

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Eric Beabes
Confirmed. The cluster Admin said his team installed the latest version from Cloudera which comes with Spark 3.0.0-preview2. They are going to try to upgrade it with the Community edition Spark 3.1.0. Thanks Jungtaek for the tip. Greatly appreciate it. On Tue, Jan 19, 2021 at 8:45 AM Eric Beabes

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Eric Beabes
e.org/jira/projects/SPARK/summary> and >>> regarding the repo, I believe just commit it to your personal repo and that >>> should be it. >>> >>> Regards >>> >>> On Mon, 18 Jan 2021 at 15:46, Eric Beabes >>> wrote: >>> >

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Eric Beabes
e a jira and commit the > code into github? > It would speed things up a lot. > > G > > > On Mon, Jan 18, 2021 at 2:14 PM Eric Beabes > wrote: > >> Here's a very simple reproducer app. I've attached 3 files: >> SparkTest.scala, QueryListener.

Re: Data source v2 streaming sinks does not support Update mode

2021-01-18 Thread Eric Beabes
org.scalastyle scalastyle-maven-plugin 1.0.0 false true true false ${project.basedir}/src/main/scala ${project.basedir}/src/test/scala lib/scalastyle_config

Re: Data source v2 streaming sinks does not support Update mode

2021-01-13 Thread Eric Beabes
uild >> script. >> >> Thanks in advance! >> >> On Wed, Jan 13, 2021 at 3:46 PM Eric Beabes >> wrote: >> >>> I tried both. First tried 3.0.0. That didn't work so I tried 3.1.0. >>> >>> On Wed, Jan 13, 2021 at 11:35 AM Jungta

Re: Data source v2 streaming sinks does not support Update mode

2021-01-12 Thread Eric Beabes
because you've said you've used Spark 3.0 but spark-sql-kafka > dependency pointed to 3.1.0.) > > On Tue, Jan 12, 2021 at 11:04 PM Eric Beabes > wrote: > >> org.apache.spark.sql.streaming.StreamingQueryException: Data source v2 >> streaming sinks does not support

Re: Understanding Executors UI

2021-01-12 Thread Eric Beabes
gt; >> For example look at the details per executor (the numbers you reported >> are aggregated values), then also look at the “storage tab” for a list of >> cached RDDs with details. >> >> In case, Spark 3.0 has improved memory instrumentation and improved >> instru

Data source v2 streaming sinks does not support Update mode

2021-01-11 Thread Eric Beabes
Trying to port my Spark 2.4 based (Structured) streaming application to Spark 3.0. I compiled it using the dependency given below: org.apache.spark spark-sql-kafka-0-10_${scala.binary.version} 3.1.0 Every time I run it under Spark 3.0, I get this message: *Data source v2 streaming

Re: Understanding Executors UI

2021-01-07 Thread Eric Beabes
the system is low after 5pm PST. I would expect the "Memory used" to be lower than 3.3Tb after 5pm PST. Does Spark 3.0 do a better job of memory management? Wondering if upgrading to Spark 3.0 would improve performance? On Wed, Jan 6, 2021 at 2:29 PM Luca Canali wrote: > Hi Eric,

unsubscribe

2020-12-10 Thread Eric Richardson
unsubscribe

Re: Cannot perform operation after producer has been closed

2020-12-09 Thread Eric Beabes
, Nov 20, 2020 at 7:30 AM Gabor Somogyi wrote: > Happy that saved some time for you :) > We've invested quite an effort in the latest releases into streaming and > hope there will be less and less headaches like this. > > On Thu, Nov 19, 2020 at 5:55 PM Eric Beabes > wrote:

Re: Blacklisting in Spark Stateful Structured Streaming

2020-11-20 Thread Eric Beabes
t;stateful" SS job, > the blacklisting structure can be put into the user-defined state. > To use a 3rd-party cache should also be a good choice. > > Eric Beabes 于2020年11月11日周三 上午6:54写道: > >> Currently we’ve a “Stateful” Spark Structured Streaming job that computes >

Re: Cannot perform operation after producer has been closed

2020-11-19 Thread Eric Beabes
ough time to migrate to > Spark 3. > > > On Wed, Nov 18, 2020 at 11:12 PM Eric Beabes > wrote: > >> I must say.. *Spark has let me down in this case*. I am surprised an >> important issue like this hasn't been fixed in Spark 2.4. >> >> I am fighting a batt

Re: Cannot perform operation after producer has been closed

2020-11-18 Thread Eric Beabes
been asked to rewrite the code in Flink*. Moving to Spark 3.0 is not an easy option 'cause Cloudera 6.2 doesn't have a Spark 3.0 parcel So we can't upgrade to 3.0. So sad. Let me ask one more time. *Is there no way to fix this in Spark 2.4?* On Tue, Nov 10, 2020 at 11:33 AM Eric Bea

Blacklisting in Spark Stateful Structured Streaming

2020-11-10 Thread Eric Beabes
Currently we’ve a “Stateful” Spark Structured Streaming job that computes aggregates for each ID. I need to implement a new requirement which says that if the no. of incoming messages for a particular ID exceeds a certain value then add this ID to a blacklist & remove the state for it. Going forwar

Re: Cannot perform operation after producer has been closed

2020-11-10 Thread Eric Beabes
ov 10, 2020 at 11:17 AM Eric Beabes wrote: > Thanks for the reply. We are on Spark 2.4. Is there no way to get this > fixed in Spark 2.4? > > On Mon, Nov 2, 2020 at 8:32 PM Jungtaek Lim > wrote: > >> Which Spark version do you use? There's a known issue on Kafka produ

Re: Cannot perform operation after producer has been closed

2020-11-10 Thread Eric Beabes
;d like to check > whether your case is bound to the known issue or not. > > https://issues.apache.org/jira/browse/SPARK-21869 > > > On Tue, Nov 3, 2020 at 1:53 AM Eric Beabes > wrote: > >> I know this is related to Kafka but it happens during the Spark >> Structured

Cannot perform operation after producer has been closed

2020-11-02 Thread Eric Beabes
I know this is related to Kafka but it happens during the Spark Structured Streaming job that's why I am asking on this mailing list. How would you debug this or get around this in Spark Structured Streaming? Any tips would be appreciated. Thanks. java.lang.IllegalStateException: Cannot perform

Debugging tools for Spark Structured Streaming

2020-10-29 Thread Eric Beabes
We're using Spark 2.4. We recently pushed to production a product that's using Spark Structured Streaming. It's working well most of the time but occasionally, when the load is high, we've noticed that there are only 10+ 'Active Tasks' even though we've provided 128 cores. Would like to debug this

States get dropped in Structured Streaming

2020-10-22 Thread Eric Beabes
We're using Stateful Structured Streaming in Spark 2.4. We are noticing that when the load on the system is heavy & LOTs of messages are coming in some of the states disappear with no error message. Any suggestions on how we can debug this? Any tips for fixing this? Thanks in advance.

Re: Submitting Spark Job thru REST API?

2020-09-14 Thread Eric Beabes
ay to upload the JAR file prior to running this? Get the Id of this file & then submit the Spark job. Kinda like how Flink does it. I realize this is an Apache Livy question so I will also ask on their mailing list. Thanks. On Thu, Sep 3, 2020 at 11:47 AM Eric Beabes wrote: > Thank you all

Re: Submitting Spark Job thru REST API?

2020-09-03 Thread Eric Beabes
Thank you all for your responses. Will try them out. On Thu, Sep 3, 2020 at 12:06 AM tianlangstudio wrote: > Hello, Eric > Maybe you can use Spark JobServer 0.10.0 > https://github.com/spark-jobserver/spark-jobserverl > We used this with Spark 1.6, and it is awesome. You know >

Submitting Spark Job thru REST API?

2020-09-02 Thread Eric Beabes
Under Spark 2.4 is it possible to submit a Spark job thru REST API - just like the Flink job? Here's the use case: We need to submit a Spark Job to the EMR cluster but our security team is not allowing us to submit a job from the Master node or thru UI. They want us to create a "Docker Container"

Load distribution in Structured Streaming

2020-07-06 Thread Eric Beabes
In my structured streaming job I've noticed that a LOT of data keeps going to one executor whereas other executors don't process that much data. As a result, tasks on that executor take a lot of time to complete. In other words, the distribution is skewed. I believe in Structured streaming the Par

Failure Threshold in Spark Structured Streaming?

2020-07-02 Thread Eric Beabes
Currently my job fails even on a single failure. In other words, even if one incoming message is malformed the job fails. I believe there's a property that allows us to set an acceptable number of failures. I Googled but couldn't find the answer. Can someone please help? Thanks.

Question about 'maxOffsetsPerTrigger'

2020-06-30 Thread Eric Beabes
While running my Spark (Stateful) Structured Streaming job I am setting 'maxOffsetsPerTrigger' value to 10 Million. I've noticed that messages are processed faster if I use a large value for this property. What I am also noticing is that until the batch is completely processed, no messages are get

Re: Spark Structured Streaming: “earliest” as “startingOffsets” is not working

2020-06-26 Thread Eric Beabes
My apologies... After I set the 'maxOffsetsPerTrigger' to a value such as '20' it started working. Hopefully this will help someone. Thanks. On Fri, Jun 26, 2020 at 2:12 PM Something Something < mailinglist...@gmail.com> wrote: > My Spark Structured Streaming job works fine when I set "start

Unsubscribe

2019-08-20 Thread ERIC JOEL BLANCO-HERMIDA SANZ
Unsubscribe Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de

[SPARK-23153][K8s] Would be available in Spark 2.X ?

2019-06-25 Thread ERIC JOEL BLANCO-HERMIDA SANZ
Hi, I’m using Spark 2.4.3 on K8s and would like to to what’s solved in [Spark-23153], that is, be able to download dependencies through —packages and that the driver could access them. Right now, in Spark 2.4.3, after the spark-submit and download of dependencies the driver cannot access them.

Re: [Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Eric Wang
/cloud.google.com/sol >> utions/spark-on-kubernetes-engine which could be relevant. >> >> On Mon, Apr 30, 2018 at 7:51 PM, Eric Wang >> wrote: >> >>> Hello all, >>> >>> I've been trying to spark-submit a job to the Google Kubernetes E

[Spark on Google Kubernetes Engine] Properties File Error

2018-04-30 Thread Eric Wang
ttps://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html Thanks, Eric

Cascading Spark Structured streams

2017-12-28 Thread Eric Dain
I need to write a Spark Structured Streaming pipeline that involves multiple aggregations, splitting data into multiple sub-pipes and union them. Also it need to have stateful aggregation with timeout. Spark Structured Streaming support all of the required functionality but not as one stream. I di

Ingesting Large csv File to relational database

2017-01-25 Thread Eric Dain
ferred and performant way to do that using Apache Spark ? Best, Eric

Why is Spark getting Kafka data out from port 2181 ?

2016-09-10 Thread Eric Ho
r specific truststore in my Spark config ? Do I just give -D flags via JAVA_OPTS ? Thx -- -eric ho

how to pass trustStore path into pyspark ?

2016-09-02 Thread Eric Ho
I'm trying to pass a trustStore pathname into pyspark. What env variable and/or config file or script I need to change to do this ? I've tried setting JAVA_OPTS env var but to no avail... any pointer much appreciated... thx -- -eric ho

Re: how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Eric Ho
I'm interested in what I should put into the trustStore file, not just for Spark but also for Kafka and Cassandra sides.. The way I generated self-signed certs for Kafka and Cassandra sides are slightly different... On Thu, Sep 1, 2016 at 1:09 AM, Eric Ho wrote: > A working example

how should I compose keyStore and trustStore if Spark needs to talk to Kafka & Cassandra ?

2016-09-01 Thread Eric Ho
A working example would be great... Thx -- -eric ho

KeyManager exception in Spark 1.6.2

2016-08-31 Thread Eric Ho
)* *at org.apache.spark.deploy.master.Master.main(Master.scala)* ===== -- -eric ho

Spark to Kafka communication encrypted ?

2016-08-31 Thread Eric Ho
I can't find in Spark 1.6.2's docs in how to turn encryption on for Spark to Kafka communication ... I think that the Spark docs only tells you how to turn on encryption for inter Spark node communications .. Am I wrong ? Thanks. -- -eric ho

Do we still need to use Kryo serializer in Spark 1.6.2 ?

2016-08-22 Thread Eric Ho
I heard that Kryo will get phased out at some point but not sure which Spark release. I'm using PySpark, does anyone has any docs on how to call / use Kryo Serializer in PySpark ? Thanks. -- -eric ho

Re: How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
u're asking about. > > I would personally use something like CoGroup or Join between the two > RDDs. if index matters, you can use ZipWithIndex on both before you join > and then see which indexes match up. > > On Mon, Aug 15, 2016 at 1:15 PM Eric Ho wrote: > >>

How to do nested for-each loops across RDDs ?

2016-08-15 Thread Eric Ho
this RRD would have contain elements in array B as well as array A. Same argument for RRD(B). Any pointers much appreciated. Thanks. -- -eric ho

how to do nested loops over 2 arrays but use Two RDDs instead ?

2016-08-15 Thread Eric Ho
I couldn't find any RDD functions that would do this for me efficiently. I don't really want elements of RDD(A) and RDD(B) flying all over the network piecemeal... THanks. -- -eric ho

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
; Jenkins jobs have been running against Scala 2.11: > > [INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ > java8-tests_2.11 --- > > > FYI > > > On Mon, May 16, 2016 at 2:45 PM, Eric Richardson > wrote: > >> On Thu, May 12, 2016 at

Re: sbt for Spark build with Scala 2.11

2016-05-16 Thread Eric Richardson
On Thu, May 12, 2016 at 9:23 PM, Luciano Resende wrote: > Spark has moved to build using Scala 2.11 by default in master/trunk. > Does this mean that the pre-built binaries for download will also move to 2.11 as well? > > > As for the 2.0.0-SNAPSHOT, it is actually the version of master/trunk

Re: createDirectStream with offsets

2016-05-07 Thread Eric Friedman
the types you're passing in don't > match. For instance, you're passing in a message handler that returns > a tuple, but the rdd return type you're specifying (the 5th type > argument) is just String. > > On Fri, May 6, 2016 at 9:49 AM, Eric Friedman >

Re: createDirectStream with offsets

2016-05-06 Thread Eric Friedman
.1' compile 'org.apache.kafka:kafka_2.10:0.8.2.1' compile 'com.yammer.metrics:metrics-core:2.2.0' On Fri, May 6, 2016 at 7:47 AM, Eric Friedman wrote: > Hello, > > I've been using createDirectStream with Kafka and now need to switch to > the version of tha

createDirectStream with offsets

2016-05-06 Thread Eric Friedman
Hello, I've been using createDirectStream with Kafka and now need to switch to the version of that API that lets me supply offsets for my topics. I'm unable to get this to compile for some reason, even if I lift the very same usage from the Spark test suite. I'm calling it like this: val to

Hadoop Context

2016-04-28 Thread Eric Friedman
Hello, Where in the Spark APIs can I get access to the Hadoop Context instance? I am trying to implement the Spark equivalent of this public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { if (record == null) { throw

Fwd: Connection failure followed by bad shuffle files during shuffle

2016-03-15 Thread Eric Martin
ase so the bug can be more easily investigated? Best, Eric Martin

Re: RE: Spark checkpoint problem

2015-11-26 Thread eric wong
I don't think it is a deliberate design. So you may need do action on the RDD before the action of RDD, if you want to explicitly checkpoint RDD. 2015-11-26 13:23 GMT+08:00 wyphao.2007 : > Spark 1.5.2. > > 在 2015-11-26 13:19:39,"张志强(旺轩)" 写道: > > What’s your spark version? > > *发件人:* wyphao.

Re: Cassandra row count grouped by multiple columns

2015-09-11 Thread Eric Walker
ot;C2"), Row("A1", "B1", "C1") )) val schema = StructType(Seq("a", "b", "c").map(c => StructField(c, StringType))) val df = sqlContext.createDataFrame(rdd, schema) df.registerTempTable("rows") sqlContext.sql("select a,

Re: spark.shuffle.spill=false ignored?

2015-09-09 Thread Eric Walker
y and spark.shuffle.memoryFraction had no observable effect. It is possible that the ignoring of the spark.shuffle.spill setting was just a manifestation of a larger issue going back to a misconfiguration. Eric On Wed, Sep 9, 2015 at 4:48 PM, Richard Marscher wrote: > Hi Eric, > > I just wanted to do a sanity

spark.shuffle.spill=false ignored?

2015-09-03 Thread Eric Walker
plenty of space (perhaps after the fact, when temporary files have been cleaned up). Has anyone run into something like this before? I would be happy to see OOM errors, because that would be consistent with one understanding of what might be going on, but I haven't yet. Eric [1] https://www

Re: cached data between jobs

2015-09-02 Thread Eric Walker
ondered whether there had been some kind of shifting in the data.) Eric On Tue, Sep 1, 2015 at 9:54 PM, Jeff Zhang wrote: > Hi Eric, > > If the 2 jobs share the same parent stages. these stages can be skipped > for the second job. > > Here's one simple example: > >

cached data between jobs

2015-09-01 Thread Eric Walker
ng in order to get a better sense of the worst-case scenario? (It's also possible that I've simply changed something that made things faster.) Eric

Re: bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
nd from its response to changes I subsequently made that the actual code that was running was the code doing the HBase lookups. I suspect the actual shuffle, once it occurred, required on the same order of network IO as the upload to Elasticsearch that followed. Eric On Mon, Aug 31, 2015 at

bulk upload to Elasticsearch and shuffle behavior

2015-08-31 Thread Eric Walker
Does anyone know what might be going on here, and what I might be able to do to get rid of the last `repartition` call before the upload to ES? Eric

Re: build spark 1.4.1 with JDK 1.6

2015-08-25 Thread Eric Friedman
not visible to the > Maven process. Or maybe you have JRE 7 installed but not JDK 7 and > it's somehow still finding the Java 6 javac. > > On Tue, Aug 25, 2015 at 3:45 AM, Eric Friedman > wrote: > > I'm trying to build Spark 1.4 with Java 7 and despite having tha

Re: build spark 1.4.1 with JDK 1.6

2015-08-24 Thread Eric Friedman
I'm trying to build Spark 1.4 with Java 7 and despite having that as my JAVA_HOME, I get [INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ spark-launcher_2.10 --- [INFO] Using zinc server for incremental compilation [info] Compiling 8 Java sources to /Users/eric/spark/

registering an empty RDD as a temp table in a PySpark SQL context

2015-08-17 Thread Eric Walker
ecial case logic. Eric

adding a custom Scala RDD for use in PySpark

2015-08-11 Thread Eric Walker
is private. This suggests to me that I'm doing something wrong, although I got it to work with sufficient hackery. What do people recommend for a general approach in getting PySpark RDDs from HBase prefix scans? I hope I have not missed something obvious. Eric

Boosting spark.yarn.executor.memoryOverhead

2015-08-11 Thread Eric Bless
Previously I was getting a failure which included the message Container killed by YARN for exceeding memory limits. 2.1 GB of 2 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. So I attempted the following - spark-submit --jars examples.jar latest_msmtdt_by

Re: Problems getting expected results from hbase_inputformat.py

2015-08-10 Thread Eric Bless
row3'}) Just to be clear, you refer to "Spark update these two scripts recently.". What two scripts were you referencing? On Friday, August 7, 2015 7:59 PM, gen tang wrote: Hi, In fact, Pyspark use org.apache.spark.examples.pythonconverters(./examples/sr

Problems getting expected results from hbase_inputformat.py

2015-08-07 Thread Eric Bless
I’m having some difficulty getting the desired results fromthe Spark Python example hbase_inputformat.py. I’m running with CDH5.4, hbaseVersion 1.0.0, Spark v 1.3.0 Using Python version 2.6.6   I followed the example to create a test HBase table. Here’sthe data from the table I created – hbase(m

projection optimization?

2015-07-28 Thread Eric Friedman
If I have a Hive table with six columns and create a DataFrame (Spark 1.4.1) using a sqlContext.sql("select * from ...") query, the resulting physical plan shown by explain reflects the goal of returning all six columns. If I then call select("one_column") on that first DataFrame, the resulting Da

Re: Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson
ices=login,sshd,sudo. Thanks, -- Eric On Wed, Jul 8, 2015 at 2:27 PM, Eric Pederson wrote: > All: > > I recently ran into a scenario where spark-shell could communicate with > Hive but another application of mine (Spark Notebook) could not. When I > tried to get a reference t

Communication between driver, cluster and HiveServer

2015-07-08 Thread Eric Pederson
ml it does. How does the communication between the driver and Hive work? And is spark-shell somehow special in this regard? Thanks, -- Eric

Re: Error when connecting to Spark SQL via Hive JDBC driver

2015-07-07 Thread Eric Pederson
Hi Ratio - You need more than just hive-jdbc jar. Here are all of the jars that I found were needed. I got this list from https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-RunningtheJDBCSampleCode plus trial and error. [image: Inline image 1] -- Eric On

KMeans questions

2015-07-01 Thread Eric Friedman
In preparing a DataFrame (spark 1.4) to use with MLlib's kmeans.train method, is there a cleaner way to create the Vectors than this? data.map{r => Vectors.dense(r.getDouble(0), r.getDouble(3), r.getDouble(4), r.getDouble(5), r.getDouble(6))} Second, once I train the model and call predict on my

Re: Subsecond queries possible?

2015-07-01 Thread Eric Pederson
'm really comparing apples and oranges right now. But it's an interesting experiment nonetheless. -- Eric On Wed, Jul 1, 2015 at 12:47 PM, Debasish Das wrote: > If you take bitmap indices out of sybase then I am guessing spark sql will > be at par with sybase ? > > On that

Re: Subsecond queries possible?

2015-06-30 Thread Eric Pederson
interested to see how far it can be pushed. Thanks for your help! -- Eric On Tue, Jun 30, 2015 at 5:28 PM, Debasish Das wrote: > I got good runtime improvement from hive partitioninp, caching the dataset > and increasing the cores through repartition...I think for your case > gen

Subsecond queries possible?

2015-06-30 Thread Eric Pederson
e impact...documentation > says Spark SQL should read partitioned table... > > Could you please share your results with partitioned tables ? > > On Tue, Jun 30, 2015 at 5:24 AM, Eric Pederson wrote: > >> Hi Deb - >> >> One other consideration is that the filter

Re: when cached RDD will unpersist its data

2015-06-23 Thread eric wong
In a case that memory cannot hold all the cached RDD, then BlockManager will evict some older block for storage of new RDD block. Hope that will helpful. 2015-06-24 13:22 GMT+08:00 bit1...@163.com : > I am kind of consused about when cached RDD will unpersist its data. I > know we can explicitl

SPARK-8566

2015-06-23 Thread Eric Friedman
I logged this Jira this morning: https://issues.apache.org/jira/browse/SPARK-8566 I'm curious if any of the cognoscenti can advise as to a likely cause of the problem?

  1   2   >