date:20180207

Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-07 Thread Pralabh Kumar

Hi

Spark 2.0 doesn't support stored by . Is there any alternative to achieve
the same.

[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang

Hi All,

DataWorks Summit, San Jose, 2018 is a good place to share your experience of 
advanced analytics, data science, machine learning and deep learning.
We have Artificial Intelligence and Data Science session, to cover technologies 
such as:
Apache Spark, Sciki-learn, TensorFlow, Keras, Apache MXNet, PyTorch/Torch, 
XGBoost, Apache Livy, Apache Zeppelin, Jupyter, etc.
Please consider to submit abstract at 
https://dataworkssummit.com/san-jose-2018/ 



Thanks
Yanbo

unsubscribe

2018-02-07 Thread dmp

unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Issue with EFS checkpoint

2018-02-07 Thread Khan, Obaidur Rehman

Hello,

We have a Spark cluster with 3 worker nodes available as EC2 on AWS. Spark
application is running in cluster mode and the checkpoints are stored in EFS.
Spark version used is 2.2.0.

We noticed the below error coming up – our understanding was that this
intermittent checkpoint issue will be resolved with EFS once we moved away from
S3.

Caused by: java.io.FileNotFoundException: File
file:/efs/checkpoint/UPDATE_XXX/offsets/.3ff13bc6-3eeb-4b87-be87-5d1106efcd62.tmp
does not exist
at
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)

Please help me understand the issue and let me know if there is any fix
available for this.

Regards,
Rehman

The information contained in this e-mail is confidential and/or proprietary to
Capital One and/or its affiliates and may only be used solely in performance of
work or services for Capital One. The information transmitted herewith is
intended only for use by the individual or entity to which it is addressed. If
the reader of this message is not the intended recipient, you are hereby
notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.

Re: Sharing spark executor pool across multiple long running spark applications

2018-02-07 Thread Vadim Semenov

The other way might be to launch a single SparkContext and then run jobs
inside of it.

You can take a look at these projects:
-
https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
- http://livy.incubator.apache.org

Problems with this way:
- Can't update the code of your job.
- A single job can break the SparkContext.


We evaluated this way and decided to go with the dynamic allocation,
but we also had to rethink the way we write our jobs:
- Can't use caching since it locks executors, have to use checkpointing,
which adds up to computation time.
- Use some unconventional methods like reusing the same DF to write out
multiple separate things in one go.
- Sometimes remove executors from within the job, like when we know how
many we would need, so the executors could join other jobs.

On Tue, Feb 6, 2018 at 3:00 PM, Nirav Patel  wrote:

> Currently sparkContext and it's executor pool is not shareable. Each
> spakContext gets its own executor pool for entire life of an application.
> So what is the best ways to share cluster resources across multiple long
> running spark applications?
>
> Only one I see is spark dynamic allocation but it has high latency when it
> comes to real-time application.
>
>
>
>
> [image: What's New with Xactly] 
>
> 
> 
>    
>

How to preserve the order of parquet files?

2018-02-07 Thread Kevin Jung

Hi all,
In spark 2.2.1, when I load parquet files, it shows differently ordered
result of original dataset.
It seems like FileSourceScanExec.createNonBucketedReadRDD method sorts
parquet file splits by their own lengths.
-
val splitFiles = selectedPartitions.flatMap { partition =>
  partition.files.flatMap { file =>
val blockLocations = getBlockLocations(file)
if (fsRelation.fileFormat.isSplitable(
fsRelation.sparkSession, fsRelation.options, file.getPath)) {
  (0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else
remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
  partition.values, file.getPath.toUri.toString, offset, size,
hosts)
  }
} else {
  val hosts = getBlockHosts(blockLocations, 0, file.getLen)
  Seq(PartitionedFile(
partition.values, file.getPath.toUri.toString, 0, file.getLen,
hosts))
}
  }
}*.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)*

So the partitions representing the part-x.parquet files are always
shuffled when I load them.
How can I preserve the order of a original data?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark CEP with files and no streams ?

2018-02-07 Thread Esa Heikkinen

Hello

I am trying to use CEP of Spark for log files (as batch job), but not for 
streams (as realtime).
Is that possible ? If yes, do you know examples Scala codes about that ?

Or should I convert the log files (with time stamps) into streams ?
But how to handle time stamps in Spark ?

If I can not use Spark at all for this purpose, do you have any recommendations 
of other tools ?

I would want CEP type analysis for log files.

Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

[CFP] DataWorks Summit, San Jose, 2018

unsubscribe

Issue with EFS checkpoint

Re: Sharing spark executor pool across multiple long running spark applications

How to preserve the order of parquet files?

Spark CEP with files and no streams ?

7 matches

Site Navigation

Mail list logo

Footer information