Re: Hive permanent functions are not available in Spark SQL

2015-09-30 Thread Pala M Muthaia
+user list

On Tue, Sep 29, 2015 at 3:43 PM, Pala M Muthaia  wrote:

> Hi,
>
> I am trying to use internal UDFs that we have added as permanent functions
> to Hive, from within Spark SQL query (using HiveContext), but i encounter
> NoSuchObjectException, i.e. the function could not be found.
>
> However, if i execute 'show functions' command in spark SQL, the permanent
> functions appear in the list.
>
> I am using Spark 1.4.1 with Hive 0.13.1. I tried to debug this by looking
> at the log and code, but it seems both the show functions command as well
> as udf query both go through essentially the same code path, but the former
> can see the UDF but the latter can't.
>
> Any ideas on how to debug/fix this?
>
>
> Thanks,
> pala
>


Re: using JavaRDD in spark-redis connector

2015-09-30 Thread Akhil Das
You can create a JavaRDD as normal and then call the .rdd() to get the RDD.

Thanks
Best Regards

On Mon, Sep 28, 2015 at 9:01 PM, Rohith P 
wrote:

> Hi all,
>   I am trying to work with spark-redis connector (redislabs) which
> requires all transactions between redis and spark be in RDD's. The language
> I am using is Java but the connector does not accept JavaRDD's .So I tried
> using Spark context in my code instead of JavaSparkContext. But when I
> wanted to create a RDD using sc.parallelize , it asks for some scala
> related
> parameters as opposed to lists in java when I tries to have both
> javaSparkContext and sparkcontext(for connector) then Multiple contexts
> cannot be opened was the error
>  The code that I have been trying 
>
>
> // initialize spark context
> private static RedisContext config() {
> conf = new SparkConf().setAppName("redis-jedis");
> sc2=new SparkContext(conf);
> RedisContext rc=new RedisContext(sc2);
> return rc;
>
> }
> //write to redis which requires the data to be in RDD
> private static void WriteUserTacticData(RedisContext rc, String
> userid,
> String tacticsId, String value) {
> hostTup= calling(redisHost,redisPort);
> String key=userid+"-"+tacticsId;
> RDD>
> newTup=createTuple(key,value);
> rc.toRedisKV(newTup,hostTup);
>
> // the createTuple where the RDD is to be created which will be inserted
> into redis
> private static RDD> createTuple(String
> key,
> String value) {
> sc=new JavaSparkContext(conf);
> ArrayList> list= new
> ArrayList>();
> Tuple2 e= new Tuple2 String>(key,value);
> list.add(e);
> JavaRDD> javardd=
> sc.parallelize(list);
> RDD>
> newTupRdd=JavaRDD.toRDD(javardd);
> sc.close();
> return newTupRdd;
> }
>
>
>
> How would I create an RDD(not javaRDD) in java which will be accepted by
> redis connector... Any kind of related to the topic would be
> appretiated..
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/using-JavaRDD-in-spark-redis-connector-tp14391.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


CQs on WindowedStream created on running StreamingContext

2015-09-30 Thread Yogs
Hi,

We intend to run adhoc windowed continuous queries on spark streaming data.
The queries could be registered/deregistered dynamically or can be
submitted through command line. Currently Spark streaming doesn’t allow
adding any new inputs, transformations, and output operations after
starting a StreamingContext. But doing following code changes in
DStream.scala allows me to create an window on DStream even after
StreamingContext has started (in StreamingContextState.ACTIVE).

1) In DStream.validateAtInit()
Allowed adding new inputs, transformations, and output operations after
starting a streaming context
2) In DStream.persist()
Allowed to change storage level of an DStream after streaming context has
started

Ultimately the window api just does slice on the parentRDD and returns
allRDDsInWindow.
We create DataFrames out of these RDDs from this particular
WindowedDStream, and evaluate queries on those DataFrames.

1) Do you see any challenges and consequences with this approach ?
2) Will these on the fly created WindowedDStreams be accounted properly in
Runtime and memory management?
3) What is the reason we do not allow creating new windows with
StreamingContextState.ACTIVE state?
4) Does it make sense to add our own implementation of WindowedDStream in
this case?

- Yogesh


Task Execution

2015-09-30 Thread gsvic
Concerning task execution, a worker executes its assigned tasks in parallel
or sequentially?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Task-Execution-tp14411.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



GraphX PageRank keeps 3 copies of graph in memory

2015-09-30 Thread Ulanov, Alexander
Dear Spark developers,

I would like to understand GraphX caching behavior with regards to PageRank in 
Spark, in particular, the following implementation of PageRank:
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala

On each iteration the new graph is created and cached, and the old graph is 
un-cached:
1) Create new graph and cache it:
rankGraph = rankGraph.joinVertices(rankUpdates) {
(id, oldRank, msgSum) => rPrb(src, id) + (1.0 - resetProb) * msgSum
  }.cache()
2) Unpersist the old one:
  prevRankGraph.vertices.unpersist(false)
  prevRankGraph.edges.unpersist(false)

According to the code, at the end of each iteration only one graph should be in 
memory, i.e. one EdgeRDD and one VertexRDD. During the iteration, exactly 
between the mentioned lines of code, there will be two graphs: old and new. It 
is two pairs of Edge and Vertex RDDs. However, when I run the example provided 
in Spark examples folder, I observe the different behavior.

Run the example (I checked that it runs the mentioned code):
$SPARK_HOME/bin/spark-submit --class 
"org.apache.spark.examples.graphx.SynthBenchmark"  --master 
spark://mynode.net:7077 $SPARK_HOME/examples/target/spark-examples.jar

According to "Storage" and RDD DAG in Spark UI, 3 VertexRDDs and 3 EdgeRDDs are 
cached, even when all iterations are finished, given that the mentioned code 
suggests caching at most 2 (and only in particular stage of the iteration):
https://drive.google.com/file/d/0BzYMzvDiCep5WFpnQjFzNy0zYlU/view?usp=sharing
Edges (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S2JtYnhVTlV1Sms/view?usp=sharing
Vertices (the green ones are cached):
https://drive.google.com/file/d/0BzYMzvDiCep5S1k4N2NFb05RZDA/view?usp=sharing

Could you explain, why 3 VertexRDDs and 3 EdgeRDDs are cached?

Is it OK that there is a double caching in code, given that joinVertices 
implicitly caches vertices and then the graph is cached in the PageRank code?

Best regards, Alexander


Re: Speculatively using spare capacity

2015-09-30 Thread Sean Owen
Why change the number of partitions of RDDs? especially since you
can't generally do that without a shuffle. If you just mean to ramp up
and down resource usage, dynamic allocation (of executors) already
does that.

On Wed, Sep 30, 2015 at 10:49 PM, Muhammed Uluyol  wrote:
> Hello,
>
> How feasible would it be to have spark speculatively increase the number of
> partitions when there is spare capacity in the system? We want to do this to
> increase to decrease application runtime. Initially, we will assume that
> function calls of the same type will have the same runtime (e.g. all maps
> take equal time) and that the runtime will scale linearly with the number of
> workers. If a numPartitions value is specified, we may increase beyond this,
> but if a Partitioner is specified, we would not change the number of
> partitions.
>
> Some initial questions we had:
>  * Does spark already do this?
>  * Is there interest in supporting this functionality?
>  * Are there any potential issues that we should be aware of?
>  * What changes would need to be made for such a project?
>
> Thanks,
> Muhammed

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Speculatively using spare capacity

2015-09-30 Thread Muhammed Uluyol
Hello,

How feasible would it be to have spark speculatively increase the number of
partitions when there is spare capacity in the system? We want to do this
to increase to decrease application runtime. Initially, we will assume that
function calls of the same type will have the same runtime (e.g. all maps
take equal time) and that the runtime will scale linearly with the number
of workers. If a numPartitions value is specified, we may increase beyond
this, but if a Partitioner is specified, we would not change the number of
partitions.

Some initial questions we had:
 * Does spark already do this?
 * Is there interest in supporting this functionality?
 * Are there any potential issues that we should be aware of?
 * What changes would need to be made for such a project?

Thanks,
Muhammed


Re: failed to run spark sample on windows

2015-09-30 Thread Renyi Xiong
thanks a lot, it works now after I set %HADOOP_HOME%

On Tue, Sep 29, 2015 at 1:22 PM, saurfang  wrote:

> See
>
> http://stackoverflow.com/questions/26516865/is-it-possible-to-run-hadoop-jobs-like-the-wordcount-sample-in-the-local-mode
> ,
> https://issues.apache.org/jira/browse/SPARK-6961 and finally
> https://issues.apache.org/jira/browse/HADOOP-10775. The easy solution is
> to
> download a Windows Hadoop distribution and point %HADOOP_HOME% to that
> location so winutils.exe can be picked up.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/failed-to-run-spark-sample-on-windows-tp14393p14407.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: unsubscribe

2015-09-30 Thread Richard Hillegas

Hi Sukesh,

To unsubscribe from the dev list, please send a message to
dev-unsubscr...@spark.apache.org. To unsubscribe from the user list, please
send a message user-unsubscr...@spark.apache.org. Please see:
http://spark.apache.org/community.html#mailing-lists.

Thanks,
-Rick

sukesh kumar  wrote on 09/28/2015 11:39:01 PM:

> From: sukesh kumar 
> To: "u...@spark.apache.org" ,
> "dev@spark.apache.org" 
> Date: 09/28/2015 11:39 PM
> Subject: unsubscribe
>
> unsubscribe
>
> --
> Thanks & Best Regards
> Sukesh Kumar