Re: EDI (Electronic Data Interchange) parser on Spark

2018-03-13 Thread Darin McBeath
     {$cid}       {$pii}       {$content-type}       {$srctitle}       {$document-type}       {$document-subtype}       {$publication-date}       {$article-title}       {$issn}       {$isbn}            {$lang}        {$tables}       return xml-to-json($retval) Darin. On Tuesday, March 13, 2018,

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-20 Thread darin
This issue on stackoverflow maybe help https://stackoverflow.com/questions/42641573/why-does-memory-usage-of-spark-worker-increases-with-time/42642233#42642233 -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-streaming-exectors-memory-increasing-and-ex

Re: spark streaming exectors memory increasing and executor killed by yarn

2017-03-17 Thread darin
I add this code in foreachRDD block . ``` rdd.persist(StorageLevel.MEMORY_AND_DISK) ``` This exception no occur agein.But many executor dead showing in spark streaming UI . ``` User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0 f

spark streaming exectors memory increasing and executor killed by yarn

2017-03-16 Thread darin
Hi, I got this exception when streaming program run some hours. ``` *User class threw exception: org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 1194.0 failed 4 times, most recent failure: Lost task 21.3 in stage 1194.0 (TID 2475, 2.dev3, executor 66): ExecutorL

Re: [SparkSQL] too many open files although ulimit set to 1048576

2017-03-13 Thread darin
I think your sets not works try add `ulimit -n 10240 ` in spark-env.sh -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-too-many-open-files-although-ulimit-set-to-1048576-tp28490p28491.html Sent from the Apache Spark User List mailing list archive a

Dataset doesn't have partitioner after a repartition on one of the columns

2016-09-20 Thread McBeath, Darin W (ELS-STL)
d.partitioner I get Option[org.apache.spark.Partitioner] = None I would have thought it would be HashPartitioner. Does anyone know why this would be None and not HashPartitioner? Thanks. Darin.

How to find the partitioner for a Dataset

2016-09-07 Thread Darin McBeath
ld be appreciated. Thanks. Darin. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-find-the-partitioner-for-a-Dataset-tp27672.html Sent from the Apache Spark User List mailing list archive at Nabbl

Datasets and Partitioners

2016-09-06 Thread Darin McBeath
iated. Thanks. Darin. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Dataset Filter performance - trying to understand

2016-09-01 Thread Darin McBeath
uld of course increase the partitions (and the size of my cluster) but I also want to clarify my understanding of whole-stage code generation. Any thought/suggestions would be appreciated. Also, if anyone has found good resources that further explain the details

Re: Best way to read XML data from RDD

2016-08-22 Thread Darin McBeath
that it returns a string.  So, you have to be a little creative when returning multiple values (such as delimiting the values with a special character and then splitting on this delimiter).   Darin. From: Diwakar Dhanuskodi To: Darin McBeath ; Hyukjin Kwon ; Jörn Franke Cc: Felix

Re: Best way to read XML data from RDD

2016-08-21 Thread Darin McBeath
xslt to transform, extract, or filter. Like mentioned below, you want to initialize the parser in a mapPartitions call (one of the examples shows this). Hope this is helpful. Darin. From: Hyukjin Kwon To: Jörn Franke Cc: Diwakar Dhanuskodi ; Felix Cheung

RDD vs Dataset performance

2016-07-28 Thread Darin McBeath
) be faster in many use cases (such as the one I'm using above). Am I doing something wrong or is this to be expected? Thanks. Darin. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
tanding for how repartition should work or if this is a bug. Thanks Jacek for starting to dig into this. Darin. - Original Message ----- From: Darin McBeath To: Jacek Laskowski Cc: user Sent: Friday, March 11, 2016 1:57 PM Subject: Re: spark 1.6 foreachPartition only appears to be running o

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
fo("SimpleStorageServiceInit call arg1: "+ arg1); log.info("SimpleStorageServiceInit call arg2:"+ arg2); log.info("SimpleStorageServiceInit call arg3: "+ arg3); SimpleStorageService.init(this.arg1, this.arg2, this.arg3); } } From: Jacek L

Re: spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
st the job dies. Darin. From: Jacek Laskowski To: Darin McBeath Cc: user Sent: Friday, March 11, 2016 1:24 PM Subject: Re: spark 1.6 foreachPartition only appears to be running on one executor Hi, How do you check which executor is used? Can you include a

spark 1.6 foreachPartition only appears to be running on one executor

2016-03-11 Thread Darin McBeath
sing spark-submit. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Best practice for retrieving over 1 million files from S3

2016-01-13 Thread Darin McBeath
wholeTextFiles(since some of my files can have embedded newlines). But, I wonder how either of these would behave if I passed literally a million (or more) 'filenames'. Before I spend time exploring, I wanted to seek some input. Any th

Re: Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
Thanks. I already set the following in spark-defaults.conf so I don't think that is going to fix my problem. spark.eventLog.dir file:///root/spark/applicationHistory spark.eventLog.enabled true I suspect my problem must be something else. Darin. From

Spark 1.6 and Application History not working correctly

2016-01-13 Thread Darin McBeath
le to view the application history for both jobs. Has anyone else noticed this issue? Any suggestions? Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: u

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-04 Thread Darin McBeath
ytes())); } } Lastly, you will need to include the saxon9he jar file (open source version). This would work for XPath, XQuery, and XSLT. Hope this helps. When I get a chance, I will update the spark-xml-utils github site with details on the new getInstance function and som

Re: Turning off DTD Validation using XML Utils package - Spark

2015-12-01 Thread Darin McBeath
D that is completely empty) or you could have it find 'local' versions (on the workers or in S3 and then cache them locally for performance). I will post an update when the code has been adjusted. Darin. - Original Message - From: Shivalik To: user@spark.apache.org Sent: T

Re: Reading xml in java using spark

2015-09-01 Thread Darin McBeath
ava) val proc = XPathProcessor.getInstance(xpath,namespaces) recsIter.map(rec => proc.evaluateString(rec._2).toInt) }).sum There is more documentation on the spark-xml-utils github site. Let me know if the documentation is not clear or if you have any

Please add the Cincinnati spark meetup to the list of meet ups

2015-07-07 Thread Darin McBeath
 http://www.meetup.com/Cincinnati-Apache-Spark-Meetup/ Thanks. Darin.

Running into several problems with Data Frames

2015-04-17 Thread Darin McBeath
tyId: Long, EntityType: String, CustomerId: String, EntityURI: String, NumDocs: Long) val entities = sc.textFile("s3n://darin/Entities.csv") val entitiesArr = entities.map(v => v.split('|')) val dfEntity = entitiesArr.map(arr => Entity(arr(0).toLong, arr(1).toLong, arr(2),

repartitionAndSortWithinPartitions and mapPartitions and sort order

2015-03-12 Thread Darin McBeath
nted to confirm. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Question about the spark assembly deployed to the cluster with the ec2 scripts

2015-03-05 Thread Darin McBeath
s for hadoop2.0.0 (and I think Cloudera). Is there a way that I can force the install of the same assembly to the cluster that comes with the 1.2 download of spark? Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apach

Re: Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
Thanks for you quick reply. Yes, that would be fine. I would rather wait/use the optimal approach as opposed to hacking some one-off solution. Darin. From: Kostas Sakellis To: Darin McBeath Cc: User Sent: Friday, February 27, 2015 12:19 PM Subject: Re

Question about Spark best practice when counting records.

2015-02-27 Thread Darin McBeath
'good' records as well? I realize that if I lose a partition that I might over count, but perhaps that is an acceptable trade-off. I'm guessing that others have ran into this before so I would like to learn from the exper

job keeps failing with org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1

2015-02-25 Thread Darin McBeath
ne has any tips for what I should look into it would be appreciated. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
Just to close the loop in case anyone runs into the same problem I had. By setting --hadoop-major-version=2 when using the ec2 scripts, everything worked fine. Darin. - Original Message - From: Darin McBeath To: Mingyu Kim ; Aaron Davidson Cc: "user@spark.apache.org" Se

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
I'll try it and post a response. - Original Message - From: Mingyu Kim To: Darin McBeath ; Aaron Davidson Cc: "user@spark.apache.org" Sent: Monday, February 23, 2015 3:06 PM Subject: Re: Which OutputCommitter to use for S3? Cool, we will start from there. Thanks Aaron

Re: Which OutputCommitter to use for S3?

2015-02-23 Thread Darin McBeath
156) In my class, JobContext is an interface of type org.apache.hadoop.mapred.JobContext. Is there something obvious that I might be doing wrong (or messed up in the translation from Scala to Java) or something I should look into? I'm using Spark 1.2 with hadoop 2.4. Thanks. Darin.

Incorrect number of records after left outer join (I think)

2015-02-19 Thread Darin McBeath
Spark 1.2 on a stand-alone cluster on ec2. To get the counts for the records, I'm using the .count() for the RDD. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-ma

Re: How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
Thanks Imran. That's exactly what I needed to know. Darin. From: Imran Rashid To: Darin McBeath Cc: User Sent: Tuesday, February 17, 2015 8:35 PM Subject: Re: How do you get the partitioner for an RDD in Java? a JavaRDD is just a wrapper aro

How do you get the partitioner for an RDD in Java?

2015-02-17 Thread Darin McBeath
hod in Spark 1.2 or a way of getting this information. I'm curious if anyone has any suggestions for how I might go about finding how an RDD is partitioned in a Java program. Thanks. Darin. - To unsubscribe, e-mail: user-un

Re: MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
would really want to do in the first place. Thanks again for your insights. Darin. From: Imran Rashid To: Darin McBeath Cc: User Sent: Tuesday, February 17, 2015 3:29 PM Subject: Re: MapValues and Shuffle Reads Hi Darin, When you say you "see 400GB

MapValues and Shuffle Reads

2015-02-17 Thread Darin McBeath
of baseline records: " + baselinePairRDD.count()); Both hsfBaselinePairRDD and baselinePairRDD have 1024 partitions. Any insights would be appreciated. I'm using Spark 1.2.0 in a stand-alone cluster. Darin. - To unsu

Re: Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
ased from 4 to 16 .set("spark.task.maxFailures","64") // Didn't really matter as I had no failures in this run .set("spark.storage.blockManagerSlaveTimeoutMs","30"); From: Sven Krasser To: Darin McBeath Cc: User Sent:

Problems saving a large RDD (1 TB) to S3 as a sequence file

2015-01-23 Thread Darin McBeath
01/23 18:59:32 4.5 min 0.3 s 323.4 MB If anyone has some suggestions please let me know. I've tried playing around with various configuration options but I've found nothing yet that will fix the underlying issue. Thanks. Darin.

Confused about shuffle read and shuffle write

2015-01-21 Thread Darin McBeath
ere is even any 'shuffle read' when constructing the baselinePairRDD. If anyone could shed some light on this it would be appreciated. Thanks. Darin. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For ad

Confused about shuffle read and shuffle write

2015-01-20 Thread Darin McBeath
ere is even any 'shuffle read' when constructing the baselinePairRDD.  If anyone could shed some light on this it would be appreciated. Thanks. Darin.

Re: Please help me get started on Apache Spark

2014-11-20 Thread Darin McBeath
Take a look at the O'Reilly Learning Spark (Early Release) book.  I've found this very useful. Darin. From: Saurabh Agrawal To: "user@spark.apache.org" Sent: Thursday, November 20, 2014 9:04 AM Subject: Please help me get started on Apache Spark   Friends,

Confused why I'm losing workers/executors when writing a large file to S3

2014-11-13 Thread Darin McBeath
ave any thoughts/insights, it would be appreciated.  Thanks. Darin. Here is where I'm setting the 'timeout' in my spark job. SparkConf conf = new SparkConf().setAppName("SparkSync Application").set("spark.serializer",  "org.apache

ec2 script and SPARK_LOCAL_DIRS not created

2014-11-12 Thread Darin McBeath
n the workers, the directories for /mnt/spark and /mnt2/spark do not exist. Am I missing something?  Has anyone else noticed this? A colleague was started a cluster (using the ec2 scripts) but for m3.xlarge machines and both /mnt/spark and /mnt2/spark directories were created. Thanks. Darin.

What should be the number of partitions after a union and a subtractByKey

2014-11-11 Thread Darin McBeath
Assume the following where both updatePairRDD and deletePairRDD are both HashPartitioned.  Before the union, each one of these has 512 partitions.   The new created updateDeletePairRDD has 1024 partitions.  Is this the general/expected behavior for a union (the number of partitions to double)? J

Question about RDD Union and SubtractByKey

2014-11-10 Thread Darin McBeath
s an RDD (part) is being re-calculated and is resulting in this error.  But, based on other logging statements throughout my application, I don't believe this is the case. Thanks. Darin. 14/11/10 22:35:27 INFO scheduler.DAGScheduler: Failed to run count at SparkSyn

Cincinnati, OH Meetup for Apache Spark

2014-11-03 Thread Darin McBeath
Let me know if you  are interested in participating in a meet up in Cincinnati, OH to discuss Apache Spark. We currently have 4-5 different companies expressing interest but would like a few more. Darin.

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
ok. after reading some documentation, it would appear the issue is the default number of partitions for a join (200). After doing something like the following, I was able to change the value. From: Darin McBeath To: User Sent: Wednesday, October 29, 2014 1:55 PM Subject: Spark SQL

Re: Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
Sorry, hit the send key a bitt too early. Anyway, this is the code I set. sqlContext.sql("set spark.sql.shuffle.partitions=10"); From: Darin McBeath To: Darin McBeath ; User Sent: Wednesday, October 29, 2014 2:47 PM Subject: Re: Spark SQL and confused about number of partit

Spark SQL and confused about number of partitions/tasks to do a simple join.

2014-10-29 Thread Darin McBeath
in local mode (localhost[1]). Any insight would be appreciated. Thanks. Darin.

XML Utilities for Apache Spark

2014-10-29 Thread Darin McBeath
-shell as well as from a Java application.  Feel free to use, contribute, and/or let us know how this library can be improved.  Let me know if you have any questions. Darin.

what's the best way to initialize an executor?

2014-10-23 Thread Darin McBeath
be called more than once with XPathProcessor.init.  I have code in place to make sure this is not an issue.  But, I was wondering if there is a better way to accomplish something like this. Thanks. Darin.

confused about memory usage in spark

2014-10-22 Thread Darin McBeath
156MB in S3).  I get that there could be some differences between the serialized storage format and what is then used in memory, but I'm curious as to whether I'm missing something and/or should be doing things differently. Thanks. Darin.

Disabling log4j in Spark-Shell on ec2 stopped working on Wednesday (Oct 8)

2014-10-10 Thread Darin McBeath
ed by the Spark provided ec2 startup script. But, that is purely a guess on my part. I'm wondering if anyone else has noticed this issue and if so has a workaround. Thanks. Darin.

How to pass env variables from master to executors within spark-shell

2014-08-20 Thread Darin McBeath
ESS_KEY")); But, because my SparkContext already exists within spark-shell, this really isn't an option (unless I'm missing something).   Thanks. Darin.

Issues with S3 client library and Apache Spark

2014-08-15 Thread Darin McBeath
karound to this problem. Thanks. Darin. java.lang.NoSuchMethodError: org.apache.http.impl.conn.DefaultClientConnectionOperator.(Lorg/apache/http/conn/scheme/SchemeRegistry;Lorg/apache/http/conn/DnsResolver;)V at org.apache.http.impl.conn.PoolingClientConnectionManager.createCon

Should the memory of worker nodes be constrained to the size of the master node?

2014-08-14 Thread Darin McBeath
. Darin.

Is there any interest in handling XML within Spark ?

2014-08-13 Thread Darin McBeath
ark-shell (and copy the jar file to executors) root@ip-10-233-73-204 spark]$ ./bin/spark-shell --jars lib/uber-SparkUtils-0.1.jar ## Bring in the sequence file (2 million records) scala> val xmlKeyPair = sc.sequenceFile[String,String]("s3n://darin/xml/part*").cache() ## Test values aga

Re: Number of partitions and Number of concurrent tasks

2014-07-31 Thread Darin McBeath
00 --spot-price=.08 -z us-east-1e --worker-instances=2 my-cluster From: Daniel Siegmann To: Darin McBeath Cc: Daniel Siegmann ; "user@spark.apache.org" Sent: Thursday, July 31, 2014 10:04 AM Subject: Re: Number of partitions and Number of concu

Re: Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
sed on what the documentation states).  What would I want that value to be based on my configuration below?  Or, would I leave that alone? From: Daniel Siegmann To: user@spark.apache.org; Darin McBeath Sent: Wednesday, July 30, 2014 5:58 PM Subject: Re: Numbe

Number of partitions and Number of concurrent tasks

2014-07-30 Thread Darin McBeath
he running application but this had no effect.  Perhaps, this is ignored for a 'filter' and the default is the total number of cores available. I'm fairly new with Spark so maybe I'm just missing or misunderstanding something fundamental.  Any help would be appreciated. Thanks. Darin.