RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread Udbhav Agarwal
That sounds great. Thanks.
Can I assume that source for a stream in spark can only be some external source 
like kafka etc.? Source cannot be some rdd in spark or some external file ?

Thanks,
Udbhav
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Friday, September 16, 2016 3:01 AM
To: Udbhav Agarwal 
Cc: user 
Subject: RE: Spark processing Multiple Streams from a single stream


You may consider writing back to Kafka from main stream and then have 
downstream consumers.
This will keep things modular and independent.
On 15 Sep 2016 23:29, "Udbhav Agarwal" 
> wrote:
Thank you Ayan for a reply.
Source is kafka but I am reading from this source in my main stream. I will 
perform some operations here. Then I want to send the output of these operation 
to 4 parallel tasks. For these 4 parallel tasks I want 4 new streams. Is such 
an implementation possible here ?

Thanks,
Udbhav
From: ayan guha [mailto:guha.a...@gmail.com]
Sent: Thursday, September 15, 2016 6:43 PM
To: Udbhav Agarwal 
>
Cc: user >
Subject: Re: Spark processing Multiple Streams from a single stream


Depending on source. For example, if source is Kafka then you can write 4 
streaming consumers.
On 15 Sep 2016 20:11, "Udbhav Agarwal" 
> wrote:
Hi All,
I have a scenario where I want to process a message in various ways in 
parallel. For instance a message is coming inside spark stream(DStream) and I 
want to send this message to 4 different tasks in parallel. I want these 4 
different tasks to be separate streams in the original spark stream and are 
always active and waiting for input. Can I implement such a process with spark 
streaming ? How ?
Thanks in advance.

Thanks,
Udbhav Agarwal




Re: Missing output partition file in S3

2016-09-16 Thread Steve Loughran

On 15 Sep 2016, at 19:37, Chen, Kevin 
> wrote:

Hi,

Has any one encountered an issue of missing output partition file in S3 ? My 
spark job writes output to a S3 location. Occasionally, I noticed one partition 
file is missing. As a result, one chunk of data was lost. If I rerun the same 
job, the problem usually goes away. This has been happening pretty random. I 
observed once or twice a week on a daily run job. I am using Spark 1.2.1.

Very much appreciated on any input, suggestion of fix/workaround.




This doesn't sound good

Without making any promises about being able to fix this,  I would like to 
understand the setup to see if there is something that could be done to address 
this

  1.  Which S3 installation? US East or elsewhere
  2.  Which s3 client: s3n or s3a. If on hadoop 2.7+, can you switch to S3a if 
you haven't already (exception, if you are using AWS EMR you have to stick with 
their s3:// client)
  3.  Are you running in-EC2 or remotely?
  4.  How big are the datasets being generated?
  5.  Do you have speculative execution turned on
  6.  which committer? is the external "DirectCommitter", or the classic Hadoop 
FileOutputCommitter? If so  are using Hadoop 2.7.x, can you try the v2 
algorithm (hadoop.mapreduce.fileoutputcommitter.algorithm.version 2)

I should warn that the stance of myself and colleagues is "dont commit direct 
to S3", write to HDFS and do a distcp when you finally copy out the data. S3 
itself doesn't have enough consistency for committing output to work in the 
presence of all race conditions and failure modes. At least here you've noticed 
the problem; the thing people fear is not noticing that a problem has arisen

-Steve


Re: Spark Streaming-- for each new file in HDFS

2016-09-16 Thread Steve Loughran

On 16 Sep 2016, at 01:03, Peyman Mohajerian 
> wrote:

You can listen to files in a specific directory using:
Take a look at: 
http://spark.apache.org/docs/latest/streaming-programming-guide.html

streamingContext.fileStream



yes, this works

here's an example I'm using to test using object stores like s3 & azure as 
sources of data

https://github.com/steveloughran/spark/blob/c2b7d885f91bb447ace8fbac427b2fdf9c84b4ef/cloud/src/main/scala/org/apache/spark/cloud/examples/CloudStreaming.scala#L83

SparkStreamingContext.textFileStream(streamGlobPath.toUri.toString) takes a 
directory ("/incoming/") or a glob path to directories ("incoming/2016/09/*) 
and will scan for data


-It will scan every window, looking for files with a modified time within that 
window
-you can then just hook up  a map to the output, start the ssc, evalu


  val lines = ssc.textFileStream(streamGlobPath.toUri.toString)

  val matches = lines.filter(_.endsWith("3")).map(line => {
sightings.add(1)
line
  })

  matches.print()
  ssc.start()

Once a file has been processed, it will not been scanned again, even if its 
modtime is updated. (ignoring executor failure/restart, and the bits in the 
code about remember durations). That means updates to a file within a window 
can be missed.

If you are writing to files from separate code, it is safest to write elsewhere 
and then copy/rename the file once complete.


(things are slightly complicated by the fact that HDFS doesn' t update modtimes 
until (a) the file is closed or (b) enough data has been written that the write 
spans a block boundary. That means that incremental writes to HDFS may appear 
to work, but once you write > 64 MB, or work with a different FS, changes may 
get lost.

But: it does work, lets you glue up streaming code to any workflow which 
generates output in files



On Thu, Sep 15, 2016 at 10:31 AM, Jörn Franke 
> wrote:
Hi,
I recommend that the third party application puts an empty file with the same 
filename as the original file, but the extension ".uploaded". This is an 
indicator that the file has been fully (!) written to the fs. Otherwise you 
risk only reading parts of the file.
Then, you can have a file system listener for this .upload file.

Spark streaming or Kafka are not needed/suitable, if the server is a file 
server. You can use oozie (maybe with a simple custom action) to poll for 
.uploaded files and transmit them.





Re: Impersonate users using the same SparkContext

2016-09-16 Thread Steve Loughran

> On 16 Sep 2016, at 04:43, gsvigruha  
> wrote:
> 
> Hi,
> 
> is there a way to impersonate multiple users using the same SparkContext
> (e.g. like this
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/Superusers.html)
> when going through the Spark API?
> 
> What I'd like to do is that
> 1) submit a long running Spark yarn-client application using a Hadoop
> superuser (e.g. "super")
> 2) impersonate different users with "super" when reading/writing restricted
> HDFS files using the Spark API
> 
> I know about the --proxy-user flag but its effect is fixed within a
> spark-submit.
> 
> I looked at the code and it seems the username is determined by the
> SPARK_USER env var first (which seems to be always set) and then the
> UserGroupInformation.
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/Utils.scala#L2247
> What I'd like I guess is the UserGroupInformation to take priority.
> 

If you can get the Kerberos tickets or Hadoop tokens all the way to your code, 
then you execute the code in a doAs call, this adopts the kerberos tokens of 
that context to access HDFS, Hive, HBase, etc

otherUserUGI.doAs {
  
}

If you just want to run something as a different user

-short lived: have oozie set things up
-long-lived: you need the kerberos keytab of whoever the app needs to run as. 


On an insecure cluster, the identity used to talk to HDFS can actually be set 
in the env var HADOOP_USER_NAME, you can also use some of the UGI methods like 
createProxyUser() to create the identity to spoof in 

val hbase = UserGroupInformation.createRemoteUser("hbase")
hbase.doAs() { ... }


some possibly useful information

https://www.youtube.com/watch?v=Xz2tPmK2cKg
https://steveloughran.gitbooks.io/kerberos_and_hadoop/content/


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Missing output partition file in S3

2016-09-16 Thread Igor Berman
are you using speculation?

On 15 September 2016 at 21:37, Chen, Kevin  wrote:

> Hi,
>
> Has any one encountered an issue of missing output partition file in S3 ?
> My spark job writes output to a S3 location. Occasionally, I noticed one
> partition file is missing. As a result, one chunk of data was lost. If I
> rerun the same job, the problem usually goes away. This has been happening
> pretty random. I observed once or twice a week on a daily run job. I am
> using Spark 1.2.1.
>
> Very much appreciated on any input, suggestion of fix/workaround.
>
>
>
>


RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
In fact you can use rdd as well using queue stream but it is considered for
testing, as per documents.
On 16 Sep 2016 17:44, "ayan guha"  wrote:

> Rdd no. File yes, using fileStream. But filestream does not support
> replay, I think. You need to manage checkpoint yourself.
> On 16 Sep 2016 16:56, "Udbhav Agarwal"  wrote:
>
>> That sounds great. Thanks.
>>
>> Can I assume that source for a stream in spark can only be some external
>> source like kafka etc.? Source cannot be some rdd in spark or some external
>> file ?
>>
>>
>>
>> Thanks,
>>
>> Udbhav
>>
>> *From:* ayan guha [mailto:guha.a...@gmail.com]
>> *Sent:* Friday, September 16, 2016 3:01 AM
>> *To:* Udbhav Agarwal 
>> *Cc:* user 
>> *Subject:* RE: Spark processing Multiple Streams from a single stream
>>
>>
>>
>> You may consider writing back to Kafka from main stream and then have
>> downstream consumers.
>> This will keep things modular and independent.
>>
>> On 15 Sep 2016 23:29, "Udbhav Agarwal" 
>> wrote:
>>
>> Thank you Ayan for a reply.
>>
>> Source is kafka but I am reading from this source in my main stream. I
>> will perform some operations here. Then I want to send the output of these
>> operation to 4 parallel tasks. For these 4 parallel tasks I want 4 new
>> streams. Is such an implementation possible here ?
>>
>>
>>
>> Thanks,
>>
>> Udbhav
>>
>> *From:* ayan guha [mailto:guha.a...@gmail.com]
>> *Sent:* Thursday, September 15, 2016 6:43 PM
>> *To:* Udbhav Agarwal 
>> *Cc:* user 
>> *Subject:* Re: Spark processing Multiple Streams from a single stream
>>
>>
>>
>> Depending on source. For example, if source is Kafka then you can write 4
>> streaming consumers.
>>
>> On 15 Sep 2016 20:11, "Udbhav Agarwal" 
>> wrote:
>>
>> Hi All,
>>
>> I have a scenario where I want to process a message in various ways in
>> parallel. For instance a message is coming inside spark stream(DStream) and
>> I want to send this message to 4 different tasks in parallel. I want these
>> 4 different tasks to be separate streams in the original spark stream and
>> are always active and waiting for input. Can I implement such a process
>> with spark streaming ? How ?
>>
>> Thanks in advance.
>>
>>
>>
>> *Thanks,*
>>
>> *Udbhav Agarwal*
>>
>>
>>
>>
>>
>>


Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
You may have to decrease the checkpoint interval to say 5 if you're
getting StackOverflowError. You may have a particularly deep lineage
being created during iterations.

No space left on device means you don't have enough local disk to
accommodate the big shuffles in some stage. You can add more disk or
maybe look at tuning shuffle params to do more in memory and maybe
avoid spilling to disk as much.

However, given the small data size, I'm surprised that you see either problem.

10-20 iterations is usually where the model stops improving much anyway.

I can run 100 iterations of rank 30 on my *laptop* so something is
fairly wrong in your setup or maybe in other parts of your user code.

On Thu, Sep 15, 2016 at 10:00 PM, Roshani Nagmote
 wrote:
> Hi,
>
> I need help to run matrix factorization ALS algorithm in Spark MLlib.
>
> I am using dataset(1.5Gb) having 480189 users and 17770 items formatted in
> similar way as Movielens dataset.
> I am trying to run MovieLensALS example jar on this dataset on AWS Spark EMR
> cluster having 14 M4.2xlarge slaves.
>
> Command run:
> /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class
> org.apache.spark.examples.mllib.MovieLensALS --jars
> /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar
> /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32
> --numIterations 50 --kryo s3://dataset/input_dataset
>
> Issues I get:
> If I increase rank to 70 or more and numIterations 15 or more, I get
> following errors:
> 1) stack overflow error
> 2) No space left on device - shuffle phase
>
> Could you please let me know if there are any parameters I should tune to
> make this algorithm work on this dataset?
>
> For better rmse, I want to increase iterations. Am I missing something very
> trivial? Could anyone help me run this algorithm on this specific dataset
> with more iterations?
>
> Was anyone able to run ALS on spark with more than 100 iterations and rank
> more than 30?
>
> Any help will be greatly appreciated.
>
> Thanks and Regards,
> Roshani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Mich Talebzadeh
Hi Sean,

At the moment I am using Zeppelin with Spark SQL to get data from Hive. So
any connection here using visitation has to be through this sort of API.

I know Tableau only uses SQL. Zeppelin can use Spark sql directly or
through Spark Thrift Server.

The question is a user may want to create a join or something involving
many tables and the preference would be to use some sort of database.

In this case Hive is running on Spark engine so we are not talking about
Map-reduce and the associated latency.

That Hive element can be easily plugged out. So our requirement is to
present multiple tables to dashboard and let the user slice and dice.

The factors are not just speed but also the functionality. At the moment
Zeppelin uses Spark SQL. I can get rid of Hive and replace it with another
but I think I still need to have a tabular interface to Flume delivered
data.

I will be happy to consider all options

Thanks

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 September 2016 at 08:46, Sean Owen  wrote:

> Why Hive and why precompute data at 15 minute latency? there are
> several ways here to query the source data directly with no extra step
> or latency here. Even Spark SQL is real-time-ish for queries on the
> source data, and Impala (or heck Drill etc) are.
>
> On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
>  wrote:
> > OK this seems to be working for the "Batch layer". I will try to create a
> > functional diagram for it
> >
> > Publisher sends prices every two seconds
> > Kafka receives data
> > Flume delivers data from Kafka to HDFS on text files time stamped
> > A Hive ORC external table (source table) is created on the directory
> where
> > flume writes continuously
> > All temporary flume tables are prefixed by "." (hidden files), so Hive
> > external table does not see those
> > Every price row includes a timestamp
> > A conventional Hive table (target table) is created with all columns from
> > the external table + two additional columns with one being a timestamp
> from
> > Hive
> > A cron job set up that runs ever 15 minutes  as below
> > 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh
> -D
> > test > /var/tmp/populate_marketData_test.err 2>&1)
> >
> > This cron as can be seen runs runs every 15 minutes and refreshes the
> Hive
> > target table with the new data. New data meaning the price created time >
> > MAX(price created time) from the target table
> >
> > Target table statistics are updated at each run. It takes an average of 2
> > minutes to run the job
> > Thu Sep 15 22:45:01 BST 2016  === Started
> > /home/hduser/dba/bin/populate_marketData.ksh  ===
> > 15/09/2016 22:45:09.09
> > 15/09/2016 22:46:57.57
> > 2016-09-15T22:46:10
> > 2016-09-15T22:46:57
> > Thu Sep 15 22:47:21 BST 2016  === Completed
> > /home/hduser/dba/bin/populate_marketData.ksh  ===
> >
> >
> > So the target table is 15 minutes out of sync with flume data which is
> not
> > bad.
> >
> > Assuming that I replace ORC tables with Parquet, druid whatever, that
> can be
> > done pretty easily. However, although I am using Zeppelin here, people
> may
> > decide to use Tableau, QlikView etc which we need to think about the
> > connectivity between these notebooks and the underlying database. I know
> > Tableau and it is very SQL centric and works with ODBC and JDBC drivers
> or
> > native drivers. For example I know that Tableau comes with Hive supplied
> > ODBC drivers. I am not sure these database have drivers for Druid etc?
> >
> > Let me know your thoughts.
> >
> > Cheers
> >
> > Dr Mich Talebzadeh
> >
>


Re: Spark SQL - Applying transformation on a struct inside an array

2016-09-16 Thread Olivier Girardot
Hi michael,Well for nested structs, I saw in the tests the behaviour defined by
SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal
for me, I managed to make it work anyway like that :> df.withColumn("a",
struct(struct(myUDF(df("a.b.c." // I didn't put back the aliases but you see
what I mean.
What I'd like to make work in essence is something like that> val someFunc :
String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c",
myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the
previous API, maybe I'd have to put something like a struct(array(struct(… which
would be troublesome because I'd have to parse the arbitrary input string  and
create something like "a.b[*].c" => struct(array(struct(
I realise the ambiguity implied in the kind of column expression, but it doesn't
seem for now available to cleanly update data inplace at an arbitrary depth.
I'll try to work on a PR that would make this possible, but any pointers would
be appreciated.
Regards,
Olivier.
 





On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust mich...@databricks.com
wrote:
Is what you are looking for a withColumn that support in place modification of
nested columns? or is it some other problem?
On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
I tried to use the RowEncoder but got stuck along the way :The main issue really
is that even if it's possible (however tedious) to pattern match generically
Row(s) and target the nested field that you need to modify, Rows being immutable
data structure without a method like a case class's copy or any kind of lens to
create a brand new object, I ended up stuck at the step "target and extract the
field to update" without any way to update the original Row with the new value.
To sum up, I tried : * using only dataframe's API itself + my udf - which works
   for nested structs as long as no arrays are along the way
 * trying to create a udf the can apply on Row and pattern
   match recursively the path I needed to explore/modify
 * trying to create a UDT - but we seem to be stuck in a
   strange middle-ground with 2.0 because some parts of the API ended up private
   while some stayed public making it impossible to use it now (I'd be glad if
   I'm mistaken)

All of these failed for me and I ended up converting the rows to JSON and update
using JSONPath which is…. something I'd like to avoid 'pretty please' 





On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com
wrote:
Hi Guys,
Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a
public API, but it is publicly accessible. I used it recently to correct some
bad data in a few nested columns in a dataframe. It wasn't an easy job, but it
made it possible. In my particular case I was not working with arrays.
Olivier, I'm interested in seeing what you come up with.
Thanks,
Michael

On Sep 14, 2016, at 10:44 AM, Fred Reiss  wrote:
+1 to this request. I talked last week with a product group within IBM that is
struggling with the same issue. It's pretty common in data cleaning applications
for data in the early stages to have nested lists or sets inconsistent or
incomplete schema information.
Fred
On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com>  wrote:
Hi everyone,I'm currently trying to create a generic transformation mecanism on
a Dataframe to modify an arbitrary column regardless of the underlying the
schema.
It's "relatively" straightforward for complex types like struct> to
apply an arbitrary UDF on the column and replace the data "inside" the struct,
however I'm struggling to make it work for complex types containing arrays along
the way like struct>>.
Michael Armbrust seemed to allude on the mailing list/forum to a way of using
Encoders to do that, I'd be interested in any pointers, especially considering
that it's not possible to output any Row or GenericRowWithSchema from a UDF
(thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/
src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657  it
seems).
To sum up, I'd like to find a way to apply a transformation on complex nested
datatypes (arrays and struct) on a Dataframe updating the value itself.
Regards,
Olivier Girardot

 



Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94
 


Olivier Girardot| Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

RE: Spark processing Multiple Streams from a single stream

2016-09-16 Thread ayan guha
Rdd no. File yes, using fileStream. But filestream does not support replay,
I think. You need to manage checkpoint yourself.
On 16 Sep 2016 16:56, "Udbhav Agarwal"  wrote:

> That sounds great. Thanks.
>
> Can I assume that source for a stream in spark can only be some external
> source like kafka etc.? Source cannot be some rdd in spark or some external
> file ?
>
>
>
> Thanks,
>
> Udbhav
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Friday, September 16, 2016 3:01 AM
> *To:* Udbhav Agarwal 
> *Cc:* user 
> *Subject:* RE: Spark processing Multiple Streams from a single stream
>
>
>
> You may consider writing back to Kafka from main stream and then have
> downstream consumers.
> This will keep things modular and independent.
>
> On 15 Sep 2016 23:29, "Udbhav Agarwal"  wrote:
>
> Thank you Ayan for a reply.
>
> Source is kafka but I am reading from this source in my main stream. I
> will perform some operations here. Then I want to send the output of these
> operation to 4 parallel tasks. For these 4 parallel tasks I want 4 new
> streams. Is such an implementation possible here ?
>
>
>
> Thanks,
>
> Udbhav
>
> *From:* ayan guha [mailto:guha.a...@gmail.com]
> *Sent:* Thursday, September 15, 2016 6:43 PM
> *To:* Udbhav Agarwal 
> *Cc:* user 
> *Subject:* Re: Spark processing Multiple Streams from a single stream
>
>
>
> Depending on source. For example, if source is Kafka then you can write 4
> streaming consumers.
>
> On 15 Sep 2016 20:11, "Udbhav Agarwal"  wrote:
>
> Hi All,
>
> I have a scenario where I want to process a message in various ways in
> parallel. For instance a message is coming inside spark stream(DStream) and
> I want to send this message to 4 different tasks in parallel. I want these
> 4 different tasks to be separate streams in the original spark stream and
> are always active and waiting for input. Can I implement such a process
> with spark streaming ? How ?
>
> Thanks in advance.
>
>
>
> *Thanks,*
>
> *Udbhav Agarwal*
>
>
>
>
>
>


Re: Best way to present data collected by Flume through Spark

2016-09-16 Thread Sean Owen
Why Hive and why precompute data at 15 minute latency? there are
several ways here to query the source data directly with no extra step
or latency here. Even Spark SQL is real-time-ish for queries on the
source data, and Impala (or heck Drill etc) are.

On Thu, Sep 15, 2016 at 10:56 PM, Mich Talebzadeh
 wrote:
> OK this seems to be working for the "Batch layer". I will try to create a
> functional diagram for it
>
> Publisher sends prices every two seconds
> Kafka receives data
> Flume delivers data from Kafka to HDFS on text files time stamped
> A Hive ORC external table (source table) is created on the directory where
> flume writes continuously
> All temporary flume tables are prefixed by "." (hidden files), so Hive
> external table does not see those
> Every price row includes a timestamp
> A conventional Hive table (target table) is created with all columns from
> the external table + two additional columns with one being a timestamp from
> Hive
> A cron job set up that runs ever 15 minutes  as below
> 0,15,30,45 00-23 * * 1-5 (/home/hduser/dba/bin/populate_marketData.ksh -D
> test > /var/tmp/populate_marketData_test.err 2>&1)
>
> This cron as can be seen runs runs every 15 minutes and refreshes the Hive
> target table with the new data. New data meaning the price created time >
> MAX(price created time) from the target table
>
> Target table statistics are updated at each run. It takes an average of 2
> minutes to run the job
> Thu Sep 15 22:45:01 BST 2016  === Started
> /home/hduser/dba/bin/populate_marketData.ksh  ===
> 15/09/2016 22:45:09.09
> 15/09/2016 22:46:57.57
> 2016-09-15T22:46:10
> 2016-09-15T22:46:57
> Thu Sep 15 22:47:21 BST 2016  === Completed
> /home/hduser/dba/bin/populate_marketData.ksh  ===
>
>
> So the target table is 15 minutes out of sync with flume data which is not
> bad.
>
> Assuming that I replace ORC tables with Parquet, druid whatever, that can be
> done pretty easily. However, although I am using Zeppelin here, people may
> decide to use Tableau, QlikView etc which we need to think about the
> connectivity between these notebooks and the underlying database. I know
> Tableau and it is very SQL centric and works with ODBC and JDBC drivers or
> native drivers. For example I know that Tableau comes with Hive supplied
> ODBC drivers. I am not sure these database have drivers for Druid etc?
>
> Let me know your thoughts.
>
> Cheers
>
> Dr Mich Talebzadeh
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: countApprox

2016-09-16 Thread Sean Owen
countApprox gives the best answer within some timeout. Is it possible
that 1ms is more than enough to count this exactly? then the
confidence wouldn't matter. Although that seems way too fast, you're
counting ranges whose values don't actually matter, and maybe the
Python side is smart enough to use that fact. Then counting a
partition takes almost no time. Does it return immediately?

On Thu, Sep 15, 2016 at 6:20 PM, Stefano Lodi  wrote:
> I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran
> countApprox with different parameters but I failed to generate any
> approximate output. In all runs it returns the exact number of elements.
> What is the effect of approximation in countApprox supposed to be, and for
> what inputs and parameters?
>
 rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)],
 50)
 rdd.countApprox(1, 0.8)
> [Stage 12:>(0 + 0) /
> 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 12:==> (49 + 1) /
> 50]1
 rdd.countApprox(1, 0.01)
> 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very
> large size (5402 KB). The maximum recommended task size is 100 KB.
> [Stage 13:>   (47 + 3) /
> 50]1
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: very slow parquet file write

2016-09-16 Thread tosaigan...@gmail.com
Hi,

try this conf


val sc = new SparkContext(conf)
sc.hadoopConfiguration.setBoolean("parquet.enable.summary-metadata", false)


Regards,
Sai Ganesh

On Thu, Sep 15, 2016 at 11:34 PM, gaurav24 [via Apache Spark User List] <
ml-node+s1001560n27738...@n3.nabble.com> wrote:

> Hi Rok,
>
> facing similar issue with streaming where I append to parquet data every
> hour. Writing seems to be slowing down it time it writes. It has gone from
> 17 mins to 40 mins in a month
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-spark-user-list.1001560.n3.nabble.com/very-
> slow-parquet-file-write-tp25295p27738.html
> To start a new topic under Apache Spark User List, email
> ml-node+s1001560n1...@n3.nabble.com
> To unsubscribe from Apache Spark User List, click here
> 
> .
> NAML
> 
>




-
Sai Ganesh
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/very-slow-parquet-file-write-tp25295p27739.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Mich Talebzadeh
Is your Hive Thrift Server up and running on port
jdbc:hive2://10001?

Do  the following

 netstat -alnp |grep 10001

and see whether it is actually running

HTH





Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 16 September 2016 at 19:53,  wrote:

> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConnection.(
> HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.
> java:105)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:571)
>
>at java.sql.DriverManager.getConnection(DriverManager.
> java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.JavaPairRDD$$anonfun$
> toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at scala.collection.Iterator$$anon$11.next(Iterator.scala:
> 328)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.
> apply(PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.
> scala:1285)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
>
>at org.apache.spark.rdd.PairRDDFunctions$$anonfun$
> saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
>
>at org.apache.spark.scheduler.
> ResultTask.runTask(ResultTask.scala:63)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
>at org.apache.spark.executor.Executor$TaskRunner.run(
> Executor.scala:213)
>
>at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1145)
>
>at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.thrift.transport.TTransportException
>
>at org.apache.thrift.transport.TIOStreamTransport.read(
> TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.
> TTransport.readAll(TTransport.java:84)
>
>at org.apache.thrift.transport.TSaslTransport.
> receiveSaslMessage(TSaslTransport.java:182)
>
>at org.apache.thrift.transport.TSaslTransport.open(
> TSaslTransport.java:258)
>
>at org.apache.thrift.transport.TSaslClientTransport.open(
> TSaslClientTransport.java:37)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:415)
>
>at org.apache.hadoop.security.UserGroupInformation.doAs(
> UserGroupInformation.java:1657)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
>at org.apache.hive.jdbc.HiveConnection.openTransport(
> HiveConnection.java:203)
>
>... 21 more
>
>
>
> In spark conf directory hive-site.xml has the following properties
>
>
>
> 
>
>
>
> 
>
>   

Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Advait Mohan Raut
Hi

I am trying to run Spark 2.0.0 in the Microsoft Windows environment without 
hadoop or hive. I am running it in the local mode i.e. cmd> spark-shell and can 
run the shell. When I try to run the sample example 
here
 provided in the documentation, the spark execution fails to create the data 
frame at .toDF()

// Create an RDD of Person objects from a text file, convert it to a Dataframe
val peopleDF = spark.sparkContext
..textFile("examples/src/main/resources/people.txt")
..map(_.split(","))
..map(attributes => Person(attributes(0), attributes(1).trim.toInt))
..toDF()


I have also run and tried these steps separately (split and Person) but it 
fails at .toDF()

It give the sequence of exceptions like:

HiveException: java.lang.RuntimeException:
Unable to instantiate org...SessionHiveMetaStoreClient
..
..
IllegalArgumentException: java.net.URISyntaxException: Relative path in the 
absolute URI


I have also defined the case class Person(age:String, age:Int)

I had successfully run codes for spark 1.4 on Windows. Is there any guide 
available to configure spark 2.0.0 on Windows for above mentioned environment ? 
Or it does not support ? Your inputs will be appreciated.

Source: 
http://stackoverflow.com/questions/39538544/apache-spark-2-0-0-on-microsoft-windows-create-dataframe
Similar Post: 
http://stackoverflow.com/questions/39402145/spark-windows-error-when-executing-todf



Regards
Advait Mohan Raut
Essex Lake Group LLC |
Mobile: +91-99-101-700-69 |
Email: adv...@essexlg.com |
Skype: advait.raut








The information transmitted herewith is sensitive information intended only for 
use to the individual or entity to which it is addressed. If the reader of this 
message is not the intended recipient, you are hereby notified that any review, 
retransmission, dissemination, distribution, copying or other use of, or taking 
of any action in reliance upon, this information is strictly prohibited. If you 
have received this communication in error, please contact the sender and delete 
the material from your computer.

WARNING: E-mail communications cannot be guaranteed to be timely, secure, 
error-free or virus-free. The recipient of this communication should check this 
e-mail and each attachment for the presence of viruses. The sender does not 
accept any liability for any errors or omissions in the content of this 
electronic communication which arises as a result of e-mail transmission.

How PolynomialExpansion works

2016-09-16 Thread Nirav Patel
Doc says:

Take a 2-variable feature vector as an example: (x, y), if we want to
expand it with degree 2, then we get (x, x * x, y, x * y, y * y).

I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate
it to above.

Thanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube] 



Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread anupama . gangadhar
Hi,

I am trying to connect to Hive from Spark application in Kerborized cluster and 
get the following exception.  Spark version is 1.4.1 and Hive is 1.2.1. Outside 
of spark the connection goes through fine.
Am I missing any configuration parameters?

ava.sql.SQLException: Could not open connection to jdbc:hive2://10001/default;principal=hive/;ssl=false;transportMode=http;httpPath=cliservice: null
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:206)
   at 
org.apache.hive.jdbc.HiveConnection.(HiveConnection.java:178)
   at org.apache.hive.jdbc.HiveDriver.connect(HiveDriver.java:105)
   at java.sql.DriverManager.getConnection(DriverManager.java:571)
   at java.sql.DriverManager.getConnection(DriverManager.java:215)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
   at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
   at 
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
   at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
   at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
   at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.thrift.transport.TTransportException
   at 
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
   at 
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
   at 
org.apache.thrift.transport.TSaslTransport.receiveSaslMessage(TSaslTransport.java:182)
   at 
org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:258)
   at 
org.apache.thrift.transport.TSaslClientTransport.open(TSaslClientTransport.java:37)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:415)
   at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
   at 
org.apache.hadoop.hive.thrift.client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
   at 
org.apache.hive.jdbc.HiveConnection.openTransport(HiveConnection.java:203)
   ... 21 more

In spark conf directory hive-site.xml has the following properties




  hive.metastore.kerberos.keytab.file
  /etc/security/keytabs/hive.service.keytab



  hive.metastore.kerberos.principal
  hive/_HOST@



  hive.metastore.sasl.enabled
  true



  hive.metastore.uris
  thrift://:9083



  hive.server2.authentication
  KERBEROS



  hive.server2.authentication.kerberos.keytab
  /etc/security/keytabs/hive.service.keytab



  hive.server2.authentication.kerberos.principal
  hive/_HOST@



  hive.server2.authentication.spnego.keytab
  /etc/security/keytabs/spnego.service.keytab



  hive.server2.authentication.spnego.principal
  HTTP/_HOST@


  

--Thank you

If you are not the addressee, please inform us immediately that you have 
received this e-mail by mistake, and delete it. We thank you for your support.



Re: How PolynomialExpansion works

2016-09-16 Thread Sean Owen
The result includes, essentially, all the terms in (x+y) and (x+y)^2,
and so on up if you chose a higher power. It is not just the
second-degree terms.

On Fri, Sep 16, 2016 at 7:43 PM, Nirav Patel  wrote:
> Doc says:
>
> Take a 2-variable feature vector as an example: (x, y), if we want to expand
> it with degree 2, then we get (x, x * x, y, x * y, y * y).
>
> I know polynomial expansion of (x+y)^2 = x^2 + 2xy + y^2 but can't relate it
> to above.
>
> Thanks
>
>
>
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Apache Spark 2.0.0 on Microsoft Windows Create Dataframe

2016-09-16 Thread Jacek Laskowski
Hi Advait,

It's due to https://issues.apache.org/jira/browse/SPARK-15565.

See http://stackoverflow.com/a/38945867/1305344 for a solution (that's
spark.sql.warehouse.dir away). Upvote if it works for you. Thanks!

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski


On Fri, Sep 16, 2016 at 9:47 PM, Advait Mohan Raut  wrote:
> Hi
>
> I am trying to run Spark 2.0.0 in the Microsoft Windows environment without
> hadoop or hive. I am running it in the local mode i.e. cmd> spark-shell and
> can run the shell. When I try to run the sample example here provided in the
> documentation, the spark execution fails to create the data frame at .toDF()
>
> // Create an RDD of Person objects from a text file, convert it to a
> Dataframe
> val peopleDF = spark.sparkContext
> ..textFile("examples/src/main/resources/people.txt")
> ..map(_.split(","))
> ..map(attributes => Person(attributes(0), attributes(1).trim.toInt))
> ..toDF()
>
> I have also run and tried these steps separately (split and Person) but it
> fails at .toDF()
>
> It give the sequence of exceptions like:
>
> HiveException: java.lang.RuntimeException:
> Unable to instantiate org...SessionHiveMetaStoreClient
> ..
> ..
> IllegalArgumentException: java.net.URISyntaxException: Relative path in the
> absolute URI
>
> I have also defined the case class Person(age:String, age:Int)
>
> I had successfully run codes for spark 1.4 on Windows. Is there any guide
> available to configure spark 2.0.0 on Windows for above mentioned
> environment ? Or it does not support ? Your inputs will be appreciated.
>
>
> Source:
> http://stackoverflow.com/questions/39538544/apache-spark-2-0-0-on-microsoft-windows-create-dataframe
> Similar Post:
> http://stackoverflow.com/questions/39402145/spark-windows-error-when-executing-todf
>
>
>
> Regards
> Advait Mohan Raut
> Essex Lake Group LLC |
> Mobile: +91-99-101-700-69 |
> Email: adv...@essexlg.com |
> Skype: advait.raut
>
>
>
>
>
>
> The information transmitted herewith is sensitive information intended only
> for use to the individual or entity to which it is addressed. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any review, retransmission, dissemination, distribution, copying or other
> use of, or taking of any action in reliance upon, this information is
> strictly prohibited. If you have received this communication in error,
> please contact the sender and delete the material from your computer.
>
>
> WARNING: E-mail communications cannot be guaranteed to be timely, secure,
> error-free or virus-free. The recipient of this communication should check
> this e-mail and each attachment for the presence of viruses. The sender does
> not accept any liability for any errors or omissions in the content of this
> electronic communication which arises as a result of e-mail transmission.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: JDBC Very Slow

2016-09-16 Thread Nikolay Zhebet
Hi! Can you split init code with current comand? I thing it is main problem
in your code.
16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
написал:

> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
> data from PostgreSQL using JDBC? I can get to the table and see the schema,
> but when I do a show, it takes very long or keeps timing out.
>
> The code is simple.
>
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> "jdbc:postgresql://dbserver:port/database?user=user&
> password=password",
>"dbtable" -> “schema.table")).load()
>
> jdbcDF.show
>
>
> If anyone can help, please let me know.
>
> Thanks,
> Ben
>
>


Re: Spark output data to S3 is very slow

2016-09-16 Thread Takeshi Yamamuro
Hi,

Have you seen the previous thread?
https://www.mail-archive.com/user@spark.apache.org/msg56791.html

// maropu


On Sat, Sep 17, 2016 at 11:34 AM, Qiang Li  wrote:

> Hi,
>
>
> I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
> quickly, but the last step, spark spend lots of time to rename or move data
> from s3 temporary directory to real directory, then I try to set
>
> spark.hadoop.spark.sql.parquet.output.committer.
> class=org.apache.spark.sql.execution.datasources.parquet.
> DirectParquetOutputCommitter
> or
> spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.
> DirectParquetOutputCommitter
>
> But both doesn't work, looks like spark 2.0 removed these configs, how can
> I let spark output directly without temporary directory ?
>
>
>
> *This email may contain or reference confidential information and is
> intended only for the individual to whom it is addressed.  Please refrain
> from distributing, disclosing or copying this email and the information
> contained within unless you are the intended recipient.  If you received
> this email in error, please notify us at le...@appannie.com
> ** immediately and remove it from your system.*




-- 
---
Takeshi Yamamuro


Re: Error trying to connect to Hive from Spark (Yarn-Cluster Mode)

2016-09-16 Thread Deepak Sharma
Hi Anupama

To me it looks like issue with the SPN with which you are trying to connect
to hive2 , i.e. hive@hostname.

Are you able to connect to hive from spark-shell?

Try getting the tkt using any other user keytab but not hadoop services
keytab and then try running the spark submit.


Thanks

Deepak

On 17 Sep 2016 12:23 am,  wrote:

> Hi,
>
>
>
> I am trying to connect to Hive from Spark application in Kerborized
> cluster and get the following exception.  Spark version is 1.4.1 and Hive
> is 1.2.1. Outside of spark the connection goes through fine.
>
> Am I missing any configuration parameters?
>
>
>
> ava.sql.SQLException: Could not open connection to
> jdbc:hive2://10001/default;principal=hive/ server2 host>;ssl=false;transportMode=http;httpPath=cliservice: null
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:206)
>
>at org.apache.hive.jdbc.HiveConne
> ction.(HiveConnection.java:178)
>
>at org.apache.hive.jdbc.HiveDrive
> r.connect(HiveDriver.java:105)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:571)
>
>at java.sql.DriverManager.getConn
> ection(DriverManager.java:215)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:124)
>
>at SparkHiveJDBCTest$1.call(SparkHiveJDBCTest.java:1)
>
>at org.apache.spark.api.java.Java
> PairRDD$$anonfun$toScalaFunction$1.apply(JavaPairRDD.scala:1027)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at scala.collection.Iterator$$ano
> n$11.next(Iterator.scala:328)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$
> apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(
> PairRDDFunctions.scala:1108)
>
>at org.apache.spark.util.Utils$.t
> ryWithSafeFinally(Utils.scala:1285)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1116)
>
>at org.apache.spark.rdd.PairRDDFu
> nctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(Pai
> rRDDFunctions.scala:1095)
>
>at org.apache.spark.scheduler.Res
> ultTask.runTask(ResultTask.scala:63)
>
>at org.apache.spark.scheduler.Task.run(Task.scala:70)
>
>at org.apache.spark.executor.Exec
> utor$TaskRunner.run(Executor.scala:213)
>
>at java.util.concurrent.ThreadPoo
> lExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>at java.util.concurrent.ThreadPoo
> lExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>at java.lang.Thread.run(Thread.java:745)
>
> Caused by: org.apache.thrift.transport.TTransportException
>
>at org.apache.thrift.transport.TI
> OStreamTransport.read(TIOStreamTransport.java:132)
>
>at org.apache.thrift.transport.TT
> ransport.readAll(TTransport.java:84)
>
>at org.apache.thrift.transport.TS
> aslTransport.receiveSaslMessage(TSaslTransport.java:182)
>
>at org.apache.thrift.transport.TS
> aslTransport.open(TSaslTransport.java:258)
>
>at org.apache.thrift.transport.TS
> aslClientTransport.open(TSaslClientTransport.java:37)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:52)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport$1.run(TUGIAssumingTransport.java:49)
>
>at java.security.AccessController.doPrivileged(Native
> Method)
>
>at javax.security.auth.Subject.doAs(Subject.java:415)
>
>at org.apache.hadoop.security.Use
> rGroupInformation.doAs(UserGroupInformation.java:1657)
>
>at org.apache.hadoop.hive.thrift.
> client.TUGIAssumingTransport.open(TUGIAssumingTransport.java:49)
>
>at org.apache.hive.jdbc.HiveConne
> ction.openTransport(HiveConnection.java:203)
>
>... 21 more
>
>
>
> In spark conf directory hive-site.xml has the following properties
>
>
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.keytab.file
>
>   /etc/security/keytabs/hive.service.keytab
>
> 
>
>
>
> 
>
>   hive.metastore.kerberos.principal
>
>   hive/_HOST@
>
> 
>
>
>
> 
>
>   hive.metastore.sasl.enabled
>
>   true
>
> 
>
>
>
> 
>
>   hive.metastore.uris
>
>   thrift://:9083
>
> 
>
>
>
> 
>
>   hive.server2.authentication
>
>   KERBEROS
>
> 
>
>
>
> 
>
>   

Re: spark streaming kafka connector questions

2016-09-16 Thread 毅程
Thanks, That is what I am missing. I have added cache before action, and
that 2nd processing is avoided.

2016-09-10 5:10 GMT-07:00 Cody Koeninger :

> Hard to say without seeing the code, but if you do multiple actions on an
> Rdd without caching, the Rdd will be computed multiple times.
>
> On Sep 10, 2016 2:43 AM, "Cheng Yi"  wrote:
>
> After some investigation, the problem i see is liked caused by a filter and
> union of the dstream.
> if i just do kafka-stream -- process -- output operator, then there is no
> problem, one event will be fetched once.
> if i do
> kafka-stream -- process(1) - filter a stream A for later union --|
>|_ filter a stream B  -- process(2)
> -|_ A union B output process (3)
> the event will be fetched 2 times, duplicate message start process at the
> end of process(1), see following traces:
>
> 16/09/10 00:11:00 INFO CachedKafkaConsumer: Initial fetch for
> spark-executor-testgid log-analysis-topic 2 1 *(fetch EVENT 1st time)*
>
> 16/09/10 00:11:00 INFO AbstractCoordinator: Discovered coordinator
> 192.168.2.6:9092 (id: 2147483647 rack: null) for group
> spark-executor-testgid.
>
> log of processing (1) for event 1
>
> 16/09/10 00:11:03 INFO Executor: Finished task 0.0 in stage 9.0 (TID 36).
> 1401 bytes result sent to driver
>
> 16/09/10 00:11:03 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID
> 36) in 3494 ms on localhost (3/3)
>
> 16/09/10 00:11:03 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks
> have all completed, from pool
>
> 16/09/10 00:11:03 INFO DAGScheduler: ShuffleMapStage 9 (flatMapToPair
> (*processing (1)*) at SparkAppDriver.java:136) finished in 3.506 s
>
> 16/09/10 00:11:03 INFO DAGScheduler: looking for newly runnable stages
>
> 16/09/10 00:11:03 INFO DAGScheduler: running: Set()
>
> 16/09/10 00:11:03 INFO DAGScheduler: waiting: Set(ShuffleMapStage 10,
> ResultStage 11)
>
> 16/09/10 00:11:03 INFO DAGScheduler: failed: Set()
>
> 16/09/10 00:11:03 INFO DAGScheduler: Submitting ShuffleMapStage 10
> (UnionRDD[41] at union (*process (3)*) at SparkAppDriver.java:155), which
> has no missing parents
>
> 16/09/10 00:11:03 INFO DAGScheduler: Submitting 6 missing tasks from
> ShuffleMapStage 10 (UnionRDD[41] at union at SparkAppDriver.java:155)
>
> 16/09/10 00:11:03 INFO KafkaRDD: Computing topic log-analysis-topic,
> partition 2 offsets 1 -> 2
>
> 16/09/10 00:11:03 INFO CachedKafkaConsumer: Initial fetch for
> spark-executor-testgid log-analysis-topic 2 1 ( *(fetch the same EVENT 2nd
> time)*)
>
> 16/09/10 00:11:10 INFO JobScheduler: Finished job streaming job
> 147349146 ms.0 from job set of time 147349146 ms
>
> 16/09/10 00:11:10 INFO JobScheduler: Total delay: 10.920 s for time
> 147349146 ms (execution: 10.874 s)* (EVENT 1st time process cost 10.874
> s)*
>
> 16/09/10 00:11:10 INFO JobScheduler: Finished job streaming job
> 1473491465000 ms.0 from job set of time 1473491465000 ms
>
> 16/09/10 00:11:10 INFO JobScheduler: Total delay: 5.986 s for time
> 1473491465000 ms (execution: 0.066 s) *(EVENT 2nd time process cost 0.066)*
>
> and the 2nd time processing of the event finished without really doing the
> work.
>
> Help is hugely appreciated.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/spark-streaming-kafka-connector-questi
> ons-tp27681p27687.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>


Re: JDBC Very Slow

2016-09-16 Thread Takeshi Yamamuro
Hi,

It'd be better to set `predicates` in jdbc arguments for loading in
parallel.
See:
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L200

// maropu

On Sat, Sep 17, 2016 at 7:46 AM, Benjamin Kim  wrote:

> I am testing this in spark-shell. I am following the Spark documentation
> by simply adding the PostgreSQL driver to the Spark Classpath.
>
> SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell
>
>
> Then, I run the code below to connect to the PostgreSQL database to query.
> This is when I have problems.
>
> Thanks,
> Ben
>
>
> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
>
> Hi! Can you split init code with current comand? I thing it is main
> problem in your code.
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim" 
> написал:
>
>> Has anyone using Spark 1.6.2 encountered very slow responses from pulling
>> data from PostgreSQL using JDBC? I can get to the table and see the schema,
>> but when I do a show, it takes very long or keeps timing out.
>>
>> The code is simple.
>>
>> val jdbcDF = sqlContext.read.format("jdbc").options(
>> Map("url" -> "jdbc:postgresql://dbserver:po
>> rt/database?user=user=password",
>>"dbtable" -> “schema.table")).load()
>>
>> jdbcDF.show
>>
>>
>> If anyone can help, please let me know.
>>
>> Thanks,
>> Ben
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: JDBC Very Slow

2016-09-16 Thread Benjamin Kim
I am testing this in spark-shell. I am following the Spark documentation by 
simply adding the PostgreSQL driver to the Spark Classpath.

SPARK_CLASSPATH=/path/to/postgresql/driver spark-shell

Then, I run the code below to connect to the PostgreSQL database to query. This 
is when I have problems.

Thanks,
Ben


> On Sep 16, 2016, at 3:29 PM, Nikolay Zhebet  wrote:
> 
> Hi! Can you split init code with current comand? I thing it is main problem 
> in your code.
> 
> 16 сент. 2016 г. 8:26 PM пользователь "Benjamin Kim"  > написал:
> Has anyone using Spark 1.6.2 encountered very slow responses from pulling 
> data from PostgreSQL using JDBC? I can get to the table and see the schema, 
> but when I do a show, it takes very long or keeps timing out.
> 
> The code is simple.
> 
> val jdbcDF = sqlContext.read.format("jdbc").options(
> Map("url" -> 
> "jdbc:postgresql://dbserver:port/database?user=user=password",
>"dbtable" -> “schema.table")).load()
> 
> jdbcDF.show
> 
> If anyone can help, please let me know.
> 
> Thanks,
> Ben
> 



feasibility of ignite and alluxio for interfacing MPI and Spark

2016-09-16 Thread AlexG
Do Ignite and Alluxio offer reasonable means of transferring data, in memory,
from Spark to MPI? A straightforward way to transfer data is use piping, but
unless you have MPI processes running in a one-to-one mapping to the Spark
partitions, this will require some complicated logic to get working (you'll
have to handle multiple tasks sending their data to one process). 

It seems like potentially Ignite and Alluxio might allow you to pull the
data you want into each of your MPI processes without worrying about such a
requirement, but it's not clear to me from the high-level descriptions of
the systems whether this is something that can be readily realized. Is this
the case?

Another issue is that with the piping solution, you only need to store two
copies of the data: one each on the Spark and MPI sides. With Ignite and
Alluxio, would you need three? It seems that they let you replace the
standard RDDs with RDDs backed with their memory stores, but do those
perform as efficiently as the standard Spark RDDs that are persisted in
memory?

More generally, I'd be interested to know if there are existing solutions to
this problem of transferring data between MPI and Spark. Thanks for any
insight you can offer!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/feasibility-of-ignite-and-alluxio-for-interfacing-MPI-and-Spark-tp27745.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark output data to S3 is very slow

2016-09-16 Thread Qiang Li
Hi,


I ran some jobs with Spark 2.0 on Yarn, I found all tasks finished very
quickly, but the last step, spark spend lots of time to rename or move data
from s3 temporary directory to real directory, then I try to set

spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter
or
spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter

But both doesn't work, looks like spark 2.0 removed these configs, how can
I let spark output directly without temporary directory ?

-- 
*This email may contain or reference confidential information and is 
intended only for the individual to whom it is addressed.  Please refrain 
from distributing, disclosing or copying this email and the information 
contained within unless you are the intended recipient.  If you received 
this email in error, please notify us at le...@appannie.com 
** immediately and remove it from your system.*


Re: Spark metrics when running with YARN?

2016-09-16 Thread Vladimir Tretyakov
Hello.

Found that there is also Spark metric Sink like MetricsServlet.
which is enabled by default:

https://apache.googlesource.com/spark/+/refs/heads/master/core/src/main/scala/org/apache/spark/metrics/MetricsConfig.scala#40

Tried urls:

On master:
http://localhost:8080/metrics/master/json/
http://localhost:8080/metrics/applications/json

On slaves (with workers):
http://localhost:4040/metrics/json/

got information I need.

Questions:
1. Will URLs for masted work in YARN (client/server mode) and Mesos modes?
Or this is only for Standalone mode?
2. Will URL for slave also work for modes other than Standalone?

Why are there 2 ways to get information, REST API and this Sink?


Best regards, Vladimir.






On Mon, Sep 12, 2016 at 3:53 PM, Vladimir Tretyakov <
vladimir.tretya...@sematext.com> wrote:

> Hello Saisai Shao, Jacek Laskowski , thx for information.
>
> We are working on Spark monitoring tool and our users have different setup
> modes (Standalone, Mesos, YARN).
>
> Looked at code, found:
>
> /**
>  * Attempt to start a Jetty server bound to the supplied hostName:port using 
> the given
>  * context handlers.
>  *
>  * If the desired port number is contended, continues
> *incrementing ports until a free port is** * found*. Return the jetty Server 
> object, the chosen port, and a mutable collection of handlers.
>  */
>
> It seems most generic way (which will work for most users) will be start
> looking at ports:
>
> spark.ui.port (4040 by default)
> spark.ui.port + 1
> spark.ui.port + 2
> spark.ui.port + 3
> ...
>
> Until we will get responses from Spark.
>
> PS: yeah they may be some intersections with some other applications for
> some setups, in this case we may ask users about these exceptions and do
> our housework around them.
>
> Best regards, Vladimir.
>
> On Mon, Sep 12, 2016 at 12:07 PM, Saisai Shao 
> wrote:
>
>> Here is the yarn RM REST API for you to refer (
>> http://hadoop.apache.org/docs/r2.7.0/hadoop-yarn/hadoop-
>> yarn-site/ResourceManagerRest.html). You can use these APIs to query
>> applications running on yarn.
>>
>> On Sun, Sep 11, 2016 at 11:25 PM, Jacek Laskowski 
>> wrote:
>>
>>> Hi Vladimir,
>>>
>>> You'd have to talk to your cluster manager to query for all the
>>> running Spark applications. I'm pretty sure YARN and Mesos can do that
>>> but unsure about Spark Standalone. This is certainly not something a
>>> Spark application's web UI could do for you since it is designed to
>>> handle the single Spark application.
>>>
>>> Pozdrawiam,
>>> Jacek Laskowski
>>> 
>>> https://medium.com/@jaceklaskowski/
>>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>>> Follow me at https://twitter.com/jaceklaskowski
>>>
>>>
>>> On Sun, Sep 11, 2016 at 11:18 AM, Vladimir Tretyakov
>>>  wrote:
>>> > Hello Jacek, thx a lot, it works.
>>> >
>>> > Is there a way how to get list of running applications from REST API?
>>> Or I
>>> > have to try connect 4040 4041... 40xx ports and check if ports answer
>>> > something?
>>> >
>>> > Best regards, Vladimir.
>>> >
>>> > On Sat, Sep 10, 2016 at 6:00 AM, Jacek Laskowski 
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> That's correct. One app one web UI. Open 4041 and you'll see the other
>>> >> app.
>>> >>
>>> >> Jacek
>>> >>
>>> >>
>>> >> On 9 Sep 2016 11:53 a.m., "Vladimir Tretyakov"
>>> >>  wrote:
>>> >>>
>>> >>> Hello again.
>>> >>>
>>> >>> I am trying to play with Spark version "2.11-2.0.0".
>>> >>>
>>> >>> Problem that REST API and UI shows me different things.
>>> >>>
>>> >>> I've stared 2 applications from "examples set": opened 2 consoles
>>> and run
>>> >>> following command in each:
>>> >>>
>>> >>> ./bin/spark-submit   --class org.apache.spark.examples.SparkPi
>>>  --master
>>> >>> spark://wawanawna:7077  --executor-memory 2G  --total-executor-cores
>>> 30
>>> >>> examples/jars/spark-examples_2.11-2.0.0.jar  1
>>> >>>
>>> >>> Request to API endpoint:
>>> >>>
>>> >>> http://localhost:4040/api/v1/applications
>>> >>>
>>> >>> returned me following JSON:
>>> >>>
>>> >>> [ {
>>> >>>   "id" : "app-20160909184529-0016",
>>> >>>   "name" : "Spark Pi",
>>> >>>   "attempts" : [ {
>>> >>> "startTime" : "2016-09-09T15:45:25.047GMT",
>>> >>> "endTime" : "1969-12-31T23:59:59.999GMT",
>>> >>> "lastUpdated" : "2016-09-09T15:45:25.047GMT",
>>> >>> "duration" : 0,
>>> >>> "sparkUser" : "",
>>> >>> "completed" : false,
>>> >>> "startTimeEpoch" : 1473435925047,
>>> >>> "endTimeEpoch" : -1,
>>> >>> "lastUpdatedEpoch" : 1473435925047
>>> >>>   } ]
>>> >>> } ]
>>> >>>
>>> >>> so response contains information only about 1 application.
>>> >>>
>>> >>> But in reality I've started 2 applications and Spark UI shows me 2
>>> >>> RUNNING application (please see screenshot).
>>> >>>
>>> >>> Does anybody maybe know answer why API and UI shows different things?
>>> >>>
>>> >>>
>>> >>> Best 

Re: 答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Sorry, it wasn't the count it was the reduce method that retrieves
information from the RDD.
I has to go through all the rdd values to return the result.


2016-09-16 11:18 GMT-03:00 chen yong :

> Dear Dirceu,
>
>
> I am totally confused . In your reply you mentioned ".the count does
> that, ..." .However, in the code snippet shown in  the attachment file 
> FelixProblem.png
> of your previous mail,  I cannot find any 'count' ACTION is called.  Would
> you please clearly show me the line it is which triggeres the evaluation.
>
> Thanks you very much
> --
> *发件人:* Dirceu Semighini Filho 
> *发送时间:* 2016年9月16日 21:07
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in
> an anonymous function
>
> Hello Felix,
> No, this line isn't the one that is triggering the execution of the
> function, the count does that, unless your count val is a lazy val.
> The count method is the one that retrieves the information of the rdd, it
> has do go through all of it's data do determine how many records the RDD
> has.
>
> Regards,
>
> 2016-09-15 22:23 GMT-03:00 chen yong :
>
>>
>> Dear Dirceu,
>>
>> Thanks for your kind help.
>> i cannot see any code line corresponding to ". retrieve the data from
>> your DataFrame/RDDs". which you suggested in the previous replies.
>>
>> Later, I guess
>>
>> the line
>>
>> val test = count
>>
>> is the key point. without it, it would not stop at the breakpont-1, right?
>>
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho 
>> *发送时间:* 2016年9月16日 0:39
>> *收件人:* chen yong
>> *抄送:* user@spark.apache.org
>> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
>> anonymous function
>>
>> Hi Felix,
>> Are sure your n is greater than 0?
>> Here it stops first at breakpoint 1, image attached.
>> Have you got the count to see if it's also greater than 0?
>>
>> 2016-09-15 11:41 GMT-03:00 chen yong :
>>
>>> Dear Dirceu
>>>
>>>
>>> Thank you for your help.
>>>
>>>
>>> Acutally, I use Intellij IDEA to dubug the spark code.
>>>
>>>
>>> Let me use the following code snippet to illustrate my problem. In the
>>> code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
>>> when i debuged the code, it did not stop at breakpoint-1, it seems that
>>> the map
>>>
>>> function was skipped and it directly reached and stoped at the
>>> breakpoint-2.
>>>
>>> Additionally, I find the following two posts
>>> (1)http://stackoverflow.com/questions/29208844/apache-spark-
>>> logging-within-scala
>>> (2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html
>>>
>>> I am wondering whether loggin is an alternative approach to debugging
>>> spark anonymous functions.
>>>
>>>
>>> val count = spark.parallelize(1 to n, slices).map { i =>
>>>   val x = random * 2 - 1
>>>   val y = random * 2 - 1 (breakpoint-1 set in this line)
>>>   if (x*x + y*y < 1) 1 else 0
>>> }.reduce(_ + _)
>>> val test = x (breakpoint-2 set in this line)
>>>
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho 
>>> *发送时间:* 2016年9月14日 23:32
>>> *收件人:* chen yong
>>> *主题:* Re: 答复: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> I don't know which IDE do you use. I use Intellij, and here there is an
>>> Evaluate Expression dialog where I can execute code, whenever it has
>>> stopped in a breakpoint.
>>> In eclipse you have watch and inspect where you can do the same.
>>> Probably you are not seeing the debug stop in your functions because you
>>> never retrieve the data from your DataFrame/RDDs.
>>> What are you doing with this function? Are you getting the result of
>>> this RDD/Dataframe at some place?
>>> You can add a count after the function that you want to debug, just for
>>> debug, but don't forget to remove this after testing.
>>>
>>>
>>>
>>> 2016-09-14 12:20 GMT-03:00 chen yong :
>>>
 Dear Dirceu,


 thanks you again.


 Actually,I never saw it stopped at the breakpoints no matter how long I
 wait.  It just skipped the whole anonymous function to direactly reach
 the first breakpoint immediately after the anonymous function body. Is that
 normal? I suspect sth wrong in my debugging operations or settings. I am
 very new to spark and  scala.


 Additionally, please give me some detailed instructions about  "Some
 ides provide you a place where you can execute the code to see it's
 results". where is the PLACE


 your help badly needed!


 --
 *发件人:* Dirceu Semighini Filho 
 *发送时间:* 2016年9月14日 23:07
 *收件人:* chen yong
 *主题:* Re: 答复: t it does not stop at breakpoints which is in an

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Dear Dirceu,


I am totally confused . In your reply you mentioned ".the count does that, 
..." .However, in the code snippet shown in  the attachment file 
FelixProblem.png of your previous mail,  I cannot find any 'count' ACTION is 
called.  Would you please clearly show me the line it is which triggeres the 
evaluation.

Thanks you very much

发件人: Dirceu Semighini Filho 
发送时间: 2016年9月16日 21:07
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an 
anonymous function

Hello Felix,
No, this line isn't the one that is triggering the execution of the function, 
the count does that, unless your count val is a lazy val.
The count method is the one that retrieves the information of the rdd, it has 
do go through all of it's data do determine how many records the RDD has.

Regards,

2016-09-15 22:23 GMT-03:00 chen yong 
>:


Dear Dirceu,

Thanks for your kind help.
i cannot see any code line corresponding to ". retrieve the data from your 
DataFrame/RDDs". which you suggested in the previous replies.

Later, I guess

the line

val test = count

is the key point. without it, it would not stop at the breakpont-1, right?




发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月16日 0:39
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous 
function

Hi Felix,
Are sure your n is greater than 0?
Here it stops first at breakpoint 1, image attached.
Have you got the count to see if it's also greater than 0?

2016-09-15 11:41 GMT-03:00 chen yong 
>:

Dear Dirceu


Thank you for your help.


Acutally, I use Intellij IDEA to dubug the spark code.


Let me use the following code snippet to illustrate my problem. In the code 
lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2. when i 
debuged the code, it did not stop at breakpoint-1, it seems that the map

function was skipped and it directly reached and stoped at the breakpoint-2.

Additionally, I find the following two posts
(1)http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala
(2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html

I am wondering whether loggin is an alternative approach to debugging spark 
anonymous functions.


val count = spark.parallelize(1 to n, slices).map { i =>
  val x = random * 2 - 1
  val y = random * 2 - 1 (breakpoint-1 set in this line)
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
val test = x (breakpoint-2 set in this line)




发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月14日 23:32
收件人: chen yong
主题: Re: 答复: 答复: t it does not stop at breakpoints which is in an anonymous 
function

I don't know which IDE do you use. I use Intellij, and here there is an 
Evaluate Expression dialog where I can execute code, whenever it has stopped in 
a breakpoint.
In eclipse you have watch and inspect where you can do the same.
Probably you are not seeing the debug stop in your functions because you never 
retrieve the data from your DataFrame/RDDs.
What are you doing with this function? Are you getting the result of this 
RDD/Dataframe at some place?
You can add a count after the function that you want to debug, just for debug, 
but don't forget to remove this after testing.



2016-09-14 12:20 GMT-03:00 chen yong 
>:

Dear Dirceu,


thanks you again.


Actually,I never saw it stopped at the breakpoints no matter how long I wait.  
It just skipped the whole anonymous function to direactly reach the first 
breakpoint immediately after the anonymous function body. Is that normal? I 
suspect sth wrong in my debugging operations or settings. I am very new to 
spark and  scala.


Additionally, please give me some detailed instructions about  "Some ides 
provide you a place where you can execute the code to see it's results". 
where is the PLACE


your help badly needed!



发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月14日 23:07
收件人: chen yong
主题: Re: 答复: t it does not stop at breakpoints which is in an anonymous function

You can call a count in the ide just to debug, or you can wait until it reaches 
the code, so you can debug.
Some ides provide you a place where you can execute the code to see it's 
results.
Be aware of not adding this operations in your production code, because they 
can slow down the execution of your code.



2016-09-14 11:43 GMT-03:00 chen yong 
>:


Thanks for your reply.

you mean i have to 

答复: it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread chen yong
Also, I wonder what is the right way to debug  spark program. If I use ten 
anonymous function in one spark program, for debugging each of them, i have to 
place a COUNT action in advace and then remove it after debugging. Is that the 
right way?



发件人: Dirceu Semighini Filho 
发送时间: 2016年9月16日 21:07
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an 
anonymous function

Hello Felix,
No, this line isn't the one that is triggering the execution of the function, 
the count does that, unless your count val is a lazy val.
The count method is the one that retrieves the information of the rdd, it has 
do go through all of it's data do determine how many records the RDD has.

Regards,

2016-09-15 22:23 GMT-03:00 chen yong 
>:


Dear Dirceu,

Thanks for your kind help.
i cannot see any code line corresponding to ". retrieve the data from your 
DataFrame/RDDs". which you suggested in the previous replies.

Later, I guess

the line

val test = count

is the key point. without it, it would not stop at the breakpont-1, right?




发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月16日 0:39
收件人: chen yong
抄送: user@spark.apache.org
主题: Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous 
function

Hi Felix,
Are sure your n is greater than 0?
Here it stops first at breakpoint 1, image attached.
Have you got the count to see if it's also greater than 0?

2016-09-15 11:41 GMT-03:00 chen yong 
>:

Dear Dirceu


Thank you for your help.


Acutally, I use Intellij IDEA to dubug the spark code.


Let me use the following code snippet to illustrate my problem. In the code 
lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2. when i 
debuged the code, it did not stop at breakpoint-1, it seems that the map

function was skipped and it directly reached and stoped at the breakpoint-2.

Additionally, I find the following two posts
(1)http://stackoverflow.com/questions/29208844/apache-spark-logging-within-scala
(2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html

I am wondering whether loggin is an alternative approach to debugging spark 
anonymous functions.


val count = spark.parallelize(1 to n, slices).map { i =>
  val x = random * 2 - 1
  val y = random * 2 - 1 (breakpoint-1 set in this line)
  if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
val test = x (breakpoint-2 set in this line)




发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月14日 23:32
收件人: chen yong
主题: Re: 答复: 答复: t it does not stop at breakpoints which is in an anonymous 
function

I don't know which IDE do you use. I use Intellij, and here there is an 
Evaluate Expression dialog where I can execute code, whenever it has stopped in 
a breakpoint.
In eclipse you have watch and inspect where you can do the same.
Probably you are not seeing the debug stop in your functions because you never 
retrieve the data from your DataFrame/RDDs.
What are you doing with this function? Are you getting the result of this 
RDD/Dataframe at some place?
You can add a count after the function that you want to debug, just for debug, 
but don't forget to remove this after testing.



2016-09-14 12:20 GMT-03:00 chen yong 
>:

Dear Dirceu,


thanks you again.


Actually,I never saw it stopped at the breakpoints no matter how long I wait.  
It just skipped the whole anonymous function to direactly reach the first 
breakpoint immediately after the anonymous function body. Is that normal? I 
suspect sth wrong in my debugging operations or settings. I am very new to 
spark and  scala.


Additionally, please give me some detailed instructions about  "Some ides 
provide you a place where you can execute the code to see it's results". 
where is the PLACE


your help badly needed!



发件人: Dirceu Semighini Filho 
>
发送时间: 2016年9月14日 23:07
收件人: chen yong
主题: Re: 答复: t it does not stop at breakpoints which is in an anonymous function

You can call a count in the ide just to debug, or you can wait until it reaches 
the code, so you can debug.
Some ides provide you a place where you can execute the code to see it's 
results.
Be aware of not adding this operations in your production code, because they 
can slow down the execution of your code.



2016-09-14 11:43 GMT-03:00 chen yong 
>:


Thanks for your reply.

you mean i have to insert some codes, such as x.count or x.collect, 
between the original spark code lines to 

Spark can't connect to secure phoenix

2016-09-16 Thread Ashish Gupta
Hi All,

I am running a spark program on secured cluster which creates SqlContext for 
creating dataframe over phoenix table.

When I run my program in local mode with --master option set to local[2] my 
program works completely fine, however when I try to run same program with 
master option set to yarn-client, I am getting below exception:

Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed 
after attempts=5, exceptions:
Fri Sep 16 12:14:10 IST 2016, RpcRetryingCaller{globalStartTime=1474008247898, 
pause=100, retries=5}, org.apache.hadoop.hbase.MasterNotRunningException: 
com.google.protobuf.ServiceException: java.io.IOException: Could not set up IO 
Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:147)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:4083)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:528)
at 
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor(HBaseAdmin.java:550)
at 
org.apache.phoenix.query.ConnectionQueryServicesImpl.ensureTableCreated(ConnectionQueryServicesImpl.java:810)
... 50 more
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: 
com.google.protobuf.ServiceException: java.io.IOException: Could not set up IO 
Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1540)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.makeStub(ConnectionManager.java:1560)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getKeepAliveMasterService(ConnectionManager.java:1711)
at 
org.apache.hadoop.hbase.client.MasterCallable.prepare(MasterCallable.java:38)
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:124)
... 54 more
Caused by: com.google.protobuf.ServiceException: java.io.IOException: Could 
not set up IO Streams to demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at 
org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$BlockingStub.isMasterRunning(MasterProtos.java:58152)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$MasterServiceStubMaker.isMasterRunning(ConnectionManager.java:1571)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStubNoRetries(ConnectionManager.java:1509)
at 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation$StubMaker.makeStub(ConnectionManager.java:1531)
... 58 more
Caused by: java.io.IOException: Could not set up IO Streams to 
demo-qa2-nn/10.60.2.15:16000
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:779)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:887)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:856)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1200)
at 
org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:213)
... 63 more
Caused by: java.lang.RuntimeException: SASL authentication failed. The most 
likely cause is missing or invalid credentials. Consider 'kinit'.
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$1.run(RpcClientImpl.java:679)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1709)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:637)
at 
org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:745)
... 67 more
Caused by: javax.security.sasl.SaslException: GSS initiate failed [Caused 
by GSSException: No valid credentials provided (Mechanism level: Failed to find 
any Kerberos tgt)]
at 
com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212)
at 
org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
at 

Re: Missing output partition file in S3

2016-09-16 Thread Tracy Li


Sent from my iPhone

> On Sep 15, 2016, at 1:37 PM, Chen, Kevin  wrote:
> 
> Hi,
> 
> Has any one encountered an issue of missing output partition file in S3 ? My 
> spark job writes output to a S3 location. Occasionally, I noticed one 
> partition file is missing. As a result, one chunk of data was lost. If I 
> rerun the same job, the problem usually goes away. This has been happening 
> pretty random. I observed once or twice a week on a daily run job. I am using 
> Spark 1.2.1.
> 
> Very much appreciated on any input, suggestion of fix/workaround.
> 
> 
> 


Hive api vs Dataset api

2016-09-16 Thread igor.berman
Hi,
I wanted to understand if there is any other advantage besides api syntax
when using hive/table api vs. dataset api in spark sql(v2.0)?
Any additional optimizations maybe?
I'm most interested in parquet partitioned tables stored on s3. Is there any
difference if I'm comfortable with dataset api too?

In general our usecase is to stream data into s3 data partitioned by some
business keys(3 levels of nesting)
In addition do hive api somehow helps with "small files" problem?(I'm aware
of coalesce)


Thanks in advance



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Hive-api-vs-Dataset-api-tp27741.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: 答复: 答复: 答复: 答复: t it does not stop at breakpoints which is in an anonymous function

2016-09-16 Thread Dirceu Semighini Filho
Hello Felix,
No, this line isn't the one that is triggering the execution of the
function, the count does that, unless your count val is a lazy val.
The count method is the one that retrieves the information of the rdd, it
has do go through all of it's data do determine how many records the RDD
has.

Regards,

2016-09-15 22:23 GMT-03:00 chen yong :

>
> Dear Dirceu,
>
> Thanks for your kind help.
> i cannot see any code line corresponding to ". retrieve the data from
> your DataFrame/RDDs". which you suggested in the previous replies.
>
> Later, I guess
>
> the line
>
> val test = count
>
> is the key point. without it, it would not stop at the breakpont-1, right?
>
>
>
> --
> *发件人:* Dirceu Semighini Filho 
> *发送时间:* 2016年9月16日 0:39
> *收件人:* chen yong
> *抄送:* user@spark.apache.org
> *主题:* Re: 答复: 答复: 答复: t it does not stop at breakpoints which is in an
> anonymous function
>
> Hi Felix,
> Are sure your n is greater than 0?
> Here it stops first at breakpoint 1, image attached.
> Have you got the count to see if it's also greater than 0?
>
> 2016-09-15 11:41 GMT-03:00 chen yong :
>
>> Dear Dirceu
>>
>>
>> Thank you for your help.
>>
>>
>> Acutally, I use Intellij IDEA to dubug the spark code.
>>
>>
>> Let me use the following code snippet to illustrate my problem. In the
>> code lines below, I've set two breakpoints, breakpoint-1 and breakpoint-2.
>> when i debuged the code, it did not stop at breakpoint-1, it seems that
>> the map
>>
>> function was skipped and it directly reached and stoped at the
>> breakpoint-2.
>>
>> Additionally, I find the following two posts
>> (1)http://stackoverflow.com/questions/29208844/apache-spark-
>> logging-within-scala
>> (2)https://www.mail-archive.com/user@spark.apache.org/msg29010.html
>>
>> I am wondering whether loggin is an alternative approach to debugging
>> spark anonymous functions.
>>
>>
>> val count = spark.parallelize(1 to n, slices).map { i =>
>>   val x = random * 2 - 1
>>   val y = random * 2 - 1 (breakpoint-1 set in this line)
>>   if (x*x + y*y < 1) 1 else 0
>> }.reduce(_ + _)
>> val test = x (breakpoint-2 set in this line)
>>
>>
>>
>> --
>> *发件人:* Dirceu Semighini Filho 
>> *发送时间:* 2016年9月14日 23:32
>> *收件人:* chen yong
>> *主题:* Re: 答复: 答复: t it does not stop at breakpoints which is in an
>> anonymous function
>>
>> I don't know which IDE do you use. I use Intellij, and here there is an
>> Evaluate Expression dialog where I can execute code, whenever it has
>> stopped in a breakpoint.
>> In eclipse you have watch and inspect where you can do the same.
>> Probably you are not seeing the debug stop in your functions because you
>> never retrieve the data from your DataFrame/RDDs.
>> What are you doing with this function? Are you getting the result of this
>> RDD/Dataframe at some place?
>> You can add a count after the function that you want to debug, just for
>> debug, but don't forget to remove this after testing.
>>
>>
>>
>> 2016-09-14 12:20 GMT-03:00 chen yong :
>>
>>> Dear Dirceu,
>>>
>>>
>>> thanks you again.
>>>
>>>
>>> Actually,I never saw it stopped at the breakpoints no matter how long I
>>> wait.  It just skipped the whole anonymous function to direactly reach
>>> the first breakpoint immediately after the anonymous function body. Is that
>>> normal? I suspect sth wrong in my debugging operations or settings. I am
>>> very new to spark and  scala.
>>>
>>>
>>> Additionally, please give me some detailed instructions about  "Some
>>> ides provide you a place where you can execute the code to see it's
>>> results". where is the PLACE
>>>
>>>
>>> your help badly needed!
>>>
>>>
>>> --
>>> *发件人:* Dirceu Semighini Filho 
>>> *发送时间:* 2016年9月14日 23:07
>>> *收件人:* chen yong
>>> *主题:* Re: 答复: t it does not stop at breakpoints which is in an
>>> anonymous function
>>>
>>> You can call a count in the ide just to debug, or you can wait until it
>>> reaches the code, so you can debug.
>>> Some ides provide you a place where you can execute the code to see it's
>>> results.
>>> Be aware of not adding this operations in your production code, because
>>> they can slow down the execution of your code.
>>>
>>>
>>>
>>> 2016-09-14 11:43 GMT-03:00 chen yong :
>>>

 Thanks for your reply.

 you mean i have to insert some codes, such as x.count or
 x.collect, between the original spark code lines to invoke some
 operations, right?
 but, where is the right places to put my code lines?

 Felix

 --
 *发件人:* Dirceu Semighini Filho 
 *发送时间:* 2016年9月14日 22:33
 *收件人:* chen yong
 *抄送:* user@spark.apache.org
 *主题:* Re: t it does not stop at breakpoints which is in an anonymous
 

Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Sean Owen
Oh this is the netflix dataset right? I recognize it from the number
of users/items. It's not fast on a laptop or anything, and takes
plenty of memory, but succeeds. I haven't run this recently but it
worked in Spark 1.x.

On Fri, Sep 16, 2016 at 5:13 PM, Roshani Nagmote
 wrote:
> I am also surprised that I face this problems with fairy small dataset on 14
> M4.2xlarge machines.  Could you please let me know on which dataset you can
> run 100 iterations of rank 30 on your laptop?
>
> I am currently just trying to run the default example code given with spark
> to run ALS on movie lens dataset. I did not change anything in the code.
> However I am running this example on Netflix dataset (1.5 gb)
>
> Thanks,
> Roshani
>
>
> On Friday, September 16, 2016, Sean Owen  wrote:
>>
>> You may have to decrease the checkpoint interval to say 5 if you're
>> getting StackOverflowError. You may have a particularly deep lineage
>> being created during iterations.
>>
>> No space left on device means you don't have enough local disk to
>> accommodate the big shuffles in some stage. You can add more disk or
>> maybe look at tuning shuffle params to do more in memory and maybe
>> avoid spilling to disk as much.
>>
>> However, given the small data size, I'm surprised that you see either
>> problem.
>>
>> 10-20 iterations is usually where the model stops improving much anyway.
>>
>> I can run 100 iterations of rank 30 on my *laptop* so something is
>> fairly wrong in your setup or maybe in other parts of your user code.
>>
>> On Thu, Sep 15, 2016 at 10:00 PM, Roshani Nagmote
>>  wrote:
>> > Hi,
>> >
>> > I need help to run matrix factorization ALS algorithm in Spark MLlib.
>> >
>> > I am using dataset(1.5Gb) having 480189 users and 17770 items formatted
>> > in
>> > similar way as Movielens dataset.
>> > I am trying to run MovieLensALS example jar on this dataset on AWS Spark
>> > EMR
>> > cluster having 14 M4.2xlarge slaves.
>> >
>> > Command run:
>> > /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn
>> > --class
>> > org.apache.spark.examples.mllib.MovieLensALS --jars
>> > /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar
>> > /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32
>> > --numIterations 50 --kryo s3://dataset/input_dataset
>> >
>> > Issues I get:
>> > If I increase rank to 70 or more and numIterations 15 or more, I get
>> > following errors:
>> > 1) stack overflow error
>> > 2) No space left on device - shuffle phase
>> >
>> > Could you please let me know if there are any parameters I should tune
>> > to
>> > make this algorithm work on this dataset?
>> >
>> > For better rmse, I want to increase iterations. Am I missing something
>> > very
>> > trivial? Could anyone help me run this algorithm on this specific
>> > dataset
>> > with more iterations?
>> >
>> > Was anyone able to run ALS on spark with more than 100 iterations and
>> > rank
>> > more than 30?
>> >
>> > Any help will be greatly appreciated.
>> >
>> > Thanks and Regards,
>> > Roshani

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
I am also surprised that I face this problems with fairy small dataset on
14 M4.2xlarge machines.  Could you please let me know on which dataset you
can run 100 iterations of rank 30 on your laptop?

I am currently just trying to run the default example code given with spark
to run ALS on movie lens dataset. I did not change anything in the
code.  However I am running this example on Netflix dataset (1.5 gb)

Thanks,
Roshani

On Friday, September 16, 2016, Sean Owen  wrote:

> You may have to decrease the checkpoint interval to say 5 if you're
> getting StackOverflowError. You may have a particularly deep lineage
> being created during iterations.
>
> No space left on device means you don't have enough local disk to
> accommodate the big shuffles in some stage. You can add more disk or
> maybe look at tuning shuffle params to do more in memory and maybe
> avoid spilling to disk as much.
>
> However, given the small data size, I'm surprised that you see either
> problem.
>
> 10-20 iterations is usually where the model stops improving much anyway.
>
> I can run 100 iterations of rank 30 on my *laptop* so something is
> fairly wrong in your setup or maybe in other parts of your user code.
>
> On Thu, Sep 15, 2016 at 10:00 PM, Roshani Nagmote
> > wrote:
> > Hi,
> >
> > I need help to run matrix factorization ALS algorithm in Spark MLlib.
> >
> > I am using dataset(1.5Gb) having 480189 users and 17770 items formatted
> in
> > similar way as Movielens dataset.
> > I am trying to run MovieLensALS example jar on this dataset on AWS Spark
> EMR
> > cluster having 14 M4.2xlarge slaves.
> >
> > Command run:
> > /usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn
> --class
> > org.apache.spark.examples.mllib.MovieLensALS --jars
> > /usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar
> > /usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32
> > --numIterations 50 --kryo s3://dataset/input_dataset
> >
> > Issues I get:
> > If I increase rank to 70 or more and numIterations 15 or more, I get
> > following errors:
> > 1) stack overflow error
> > 2) No space left on device - shuffle phase
> >
> > Could you please let me know if there are any parameters I should tune to
> > make this algorithm work on this dataset?
> >
> > For better rmse, I want to increase iterations. Am I missing something
> very
> > trivial? Could anyone help me run this algorithm on this specific dataset
> > with more iterations?
> >
> > Was anyone able to run ALS on spark with more than 100 iterations and
> rank
> > more than 30?
> >
> > Any help will be greatly appreciated.
> >
> > Thanks and Regards,
> > Roshani
>


Re: App works, but executor state is "killed"

2016-09-16 Thread satishl
Any solutions for this?
Spark version: 1.4.1, running in standalone mode. 
All my applications complete succesfully but the spark master UI shows that
the executors in KILLED Status.
Is it just a UI bug or are my executors actually KILLED?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/App-works-but-executor-state-is-killed-tp1028p27743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Issues while running MLlib matrix factorization ALS algorithm

2016-09-16 Thread Roshani Nagmote
Hello,

Thanks for your reply.

Yes, Its netflix dataset. And when I get no space on device, my ‘/mnt’ 
directory gets filled up. I checked. 

/usr/lib/spark/bin/spark-submit --deploy-mode cluster --master yarn --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
/usr/lib/spark/examples/jars/scopt_2.11-3.3.0.jar 
/usr/lib/spark/examples/jars/spark-examples_2.11-2.0.0.jar --rank 32 
--numIterations 100 --kryo s3://dataset_netflix

When I run above command, I get following error

Job aborted due to stage failure: Task 221 in stage 53.0 failed 4 times, most 
recent failure: Lost task 221.3 in stage 53.0 (TID 9817, ): 
java.io.FileNotFoundException: 
/mnt/yarn/usercache/hadoop/appcache/application_1473786456609_0042/blockmgr-045c2dec-7765-4954-9c9a-c7452f7bd3b7/08/shuffle_168_221_0.data.b17d39a6-4d3c-4198-9e25-e19ca2b4d368
 (No space left on device)

I think I should not need to increase the space on device, as data is not that 
big. So, is there any way, I can setup parameters so that it does not use much 
disk space. I don’t know much about tuning parameters. 

It will be great if anyone can help me with this.
 
Thanks,
Roshani

> On Sep 16, 2016, at 9:18 AM, Sean Owen  wrote:
> 
>  



JDBC Very Slow

2016-09-16 Thread Benjamin Kim
Has anyone using Spark 1.6.2 encountered very slow responses from pulling data 
from PostgreSQL using JDBC? I can get to the table and see the schema, but when 
I do a show, it takes very long or keeps timing out.

The code is simple.

val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> 
"jdbc:postgresql://dbserver:port/database?user=user=password",
   "dbtable" -> “schema.table")).load()

jdbcDF.show

If anyone can help, please let me know.

Thanks,
Ben