Bucketing

2018-11-12 Thread Sai
Hi all,

I am trying to bring bucketing functionality and realize it is not allowed on 
DataFrame write. Any work around for this or any update on when this 
functionality will be made available in Spark?

Thanks 
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: writing to local files on a worker

2018-11-12 Thread Steve Lewis
I have been looking at Spark-Blast which calls Blast - a well known C++
program in parallel -
In my case I have tried to translate the C++ code to Java but am not
getting the same results - it is convoluted -
I have code that will call the program and read its results - the only real
issue is the program wants local files -
their use is convoluted with many seeks so replacement with streaming will
not work -
as long as my Java code can write to a local file for the duration of one
call things can work -

I considered in memory files as long as they can be passed to another
program - I am willing to have OS specific code
So my issue is I need to write 3 files - run a program and read one output
file - then all files can be deleted -
JNI calls will be hard - this is s program not a library and it is
available for worker nodes

On Sun, Nov 11, 2018 at 10:52 PM Jörn Franke  wrote:

> Can you use JNI to call the c++ functionality directly from Java?
>
> Or you wrap this into a MR step outside Spark and use Hadoop Streaming (it
> allows you to use shell scripts as mapper and reducer)?
>
> You can also write temporary files for each partition and execute the
> software within a map step.
>
> Generally you should not call external applications from Spark.
>
> > Am 11.11.2018 um 23:13 schrieb Steve Lewis :
> >
> > I have a problem where a critical step needs to be performed by  a third
> party c++ application. I can send or install this program on the worker
> nodes. I can construct  a function holding all the data this program needs
> to process. The problem is that the program is designed to read and write
> from the local file system. I can call the program from Java and read its
> output as  a  local file - then deleting all temporary files but I doubt
> that it is possible to get the program to read from hdfs or any shared file
> system.
> > My question is can a function running on a worker node create temporary
> files and pass the names of these to a local process assuming everything is
> cleaned up after the call?
> >
> > --
> > Steven M. Lewis PhD
> > 4221 105th Ave NE
> > Kirkland, WA 98033
> > 206-384-1340 (cell)
> > Skype lordjoe_com
> >
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


question about barrier execution mode in Spark 2.4.0

2018-11-12 Thread Joe

Hello,
I was reading Spark 2.4.0 release docs and I'd like to find out more 
about barrier execution mode.
In particular I'd like to know what happens when number of partitions 
exceeds number of nodes (which I think is allowed, Spark tuning doc 
mentions that)?
Does Spark guarantee that all tasks process all partitions 
simultaneously? If not then how does barrier mode handle partitions that 
are waiting to be processed?
If there are partitions waiting to be processed then I don't think it's 
possible to send all data from given stage to a DL process, even when 
using barrier mode?

Thanks a lot,

Joe


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Questions on Python support with Spark

2018-11-12 Thread Patrick McCarthy
I've never tried to run a stand-alone cluster alongside hadoop, but why not
run Spark as a yarn application? That way it can absolutely (in fact
preferably) use the distributed file system.

On Fri, Nov 9, 2018 at 5:04 PM, Arijit Tarafdar  wrote:

> Hello All,
>
>
>
> We have a requirement to run PySpark in standalone cluster mode and also
> reference python libraries (egg/wheel) which are not local but placed in a
> distributed storage like HDFS. From the code it looks like none of cases
> are supported.
>
>
>
> Questions are:
>
>
>
>1. Why is PySpark supported only in standalone client mode?
>2. Why –py-files only support local files and not files stored in
>remote stores?
>
>
>
> We will like to update the Spark code to support these scenarios but just
> want to be aware of any technical difficulties that the community has faced
> while trying to support those.
>
>
>
> Thanks, Arijit
>


Re: FW: Spark2 and Hive metastore

2018-11-12 Thread Sergey B.
In order for the Spark to see Hive metastore you need to build Spark
Session accordingly:

val spark = SparkSession.builder()
  .master("local[2]")
  .appName("myApp")
  .config("hive.metastore.uris","thrift://localhost:9083")
  .enableHiveSupport()
  .getOrCreate()

On Mon, Nov 12, 2018 at 11:49 AM Ирина Шершукова 
wrote:

>
>
> hello guys,  spark2.1.0 couldn’t connect to existing Hive metastore.
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: [Spark-Core] Long scheduling delays (1+ hour)

2018-11-12 Thread bsikander
Forgot to add the link
https://jira.apache.org/jira/browse/KAFKA-5649



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org