Are there any alternatives to Hive "stored by" clause as Spark 2.0 does not support it

2018-02-07 Thread Pralabh Kumar
Hi

Spark 2.0 doesn't support stored by . Is there any alternative to achieve
the same.


[CFP] DataWorks Summit, San Jose, 2018

2018-02-07 Thread Yanbo Liang
Hi All,

DataWorks Summit, San Jose, 2018 is a good place to share your experience of 
advanced analytics, data science, machine learning and deep learning.
We have Artificial Intelligence and Data Science session, to cover technologies 
such as:
Apache Spark, Sciki-learn, TensorFlow, Keras, Apache MXNet, PyTorch/Torch, 
XGBoost, Apache Livy, Apache Zeppelin, Jupyter, etc.
Please consider to submit abstract at 
https://dataworkssummit.com/san-jose-2018/ 



Thanks
Yanbo

unsubscribe

2018-02-07 Thread dmp
unsubscribe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Issue with EFS checkpoint

2018-02-07 Thread Khan, Obaidur Rehman
Hello,

We have a Spark cluster with 3 worker nodes available as EC2 on AWS. Spark 
application is running in cluster mode and the checkpoints are stored in EFS. 
Spark version used is 2.2.0.

We noticed the below error coming up – our understanding was that this 
intermittent checkpoint issue will be resolved with EFS once we moved away from 
S3.

Caused by: java.io.FileNotFoundException: File 
file:/efs/checkpoint/UPDATE_XXX/offsets/.3ff13bc6-3eeb-4b87-be87-5d1106efcd62.tmp
 does not exist
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)

Please help me understand the issue and let me know if there is any fix 
available for this.

Regards,
Rehman


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Sharing spark executor pool across multiple long running spark applications

2018-02-07 Thread Vadim Semenov
The other way might be to launch a single SparkContext and then run jobs
inside of it.

You can take a look at these projects:
-
https://github.com/spark-jobserver/spark-jobserver#persistent-context-mode---faster--required-for-related-jobs
- http://livy.incubator.apache.org

Problems with this way:
- Can't update the code of your job.
- A single job can break the SparkContext.


We evaluated this way and decided to go with the dynamic allocation,
but we also had to rethink the way we write our jobs:
- Can't use caching since it locks executors, have to use checkpointing,
which adds up to computation time.
- Use some unconventional methods like reusing the same DF to write out
multiple separate things in one go.
- Sometimes remove executors from within the job, like when we know how
many we would need, so the executors could join other jobs.

On Tue, Feb 6, 2018 at 3:00 PM, Nirav Patel  wrote:

> Currently sparkContext and it's executor pool is not shareable. Each
> spakContext gets its own executor pool for entire life of an application.
> So what is the best ways to share cluster resources across multiple long
> running spark applications?
>
> Only one I see is spark dynamic allocation but it has high latency when it
> comes to real-time application.
>
>
>
>
> [image: What's New with Xactly] 
>
> 
> 
>    
> 


How to preserve the order of parquet files?

2018-02-07 Thread Kevin Jung
Hi all,
In spark 2.2.1, when I load parquet files, it shows differently ordered
result of original dataset.
It seems like FileSourceScanExec.createNonBucketedReadRDD method sorts
parquet file splits by their own lengths.
-
val splitFiles = selectedPartitions.flatMap { partition =>
  partition.files.flatMap { file =>
val blockLocations = getBlockLocations(file)
if (fsRelation.fileFormat.isSplitable(
fsRelation.sparkSession, fsRelation.options, file.getPath)) {
  (0L until file.getLen by maxSplitBytes).map { offset =>
val remaining = file.getLen - offset
val size = if (remaining > maxSplitBytes) maxSplitBytes else
remaining
val hosts = getBlockHosts(blockLocations, offset, size)
PartitionedFile(
  partition.values, file.getPath.toUri.toString, offset, size,
hosts)
  }
} else {
  val hosts = getBlockHosts(blockLocations, 0, file.getLen)
  Seq(PartitionedFile(
partition.values, file.getPath.toUri.toString, 0, file.getLen,
hosts))
}
  }
}*.toArray.sortBy(_.length)(implicitly[Ordering[Long]].reverse)*

So the partitions representing the part-x.parquet files are always
shuffled when I load them.
How can I preserve the order of a original data?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark CEP with files and no streams ?

2018-02-07 Thread Esa Heikkinen
Hello

I am trying to use CEP of Spark for log files (as batch job), but not for 
streams (as realtime).
Is that possible ? If yes, do you know examples Scala codes about that ?

Or should I convert the log files (with time stamps) into streams ?
But how to handle time stamps in Spark ?

If I can not use Spark at all for this purpose, do you have any recommendations 
of other tools ?

I would want CEP type analysis for log files.