[Package Release] Widely accepted XGBoost now available in Spark

2016-03-16 Thread Nan Zhu
Dear Spark Users and Developers, 

(we apologize if you receive multiple copies of the email, we are resending 
because we found that our email was not delivered to user mail list correctly)
We are happy to announce the release of XGBoost4J 
(http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html),
 a Portable Distributed XGBoost in Spark, Flink and Dataflow XGBoost is an 
optimized distributed gradient boosting library designed to be highly 
efficient, flexible and portable.XGBoost provides a parallel tree boosting 
(also known as GBDT, GBM) that solve many data science problems in a fast and 
accurate way. It has been the winning solution for many machine learning 
scenarios, ranging from Machine Learning Challenges to Industrial User Cases 
XGBoost4J is a new package in XGBoost aiming to provide the clean Scala/Java 
APIs and the seamless integration with the mainstream data processing platform, 
like Apache Spark. With XGBoost4J, users can run XGBoost as a stage of Spark 
job and build a unified pipeline from ETL to Model training to data product 
service within Spark, instead of jumping across two different systems, i.e. 
XGBoost and Sp
ark. Today, we release the first version of XGBoost4J to bring more choices to 
the Spark users who are seeking the solutions to build highly efficient data 
analytic platform and enrich the Spark ecosystem. We will keep moving forward 
to integrate with more features of Spark. Of course, you are more than welcome 
to join us and contribute to the project! For more details of distributed 
XGBoost, you can refer to the recently published paper: 
http://arxiv.org/abs/1603.02754 Best, -- Nan Zhu http://codingcat.me





[Streaming] textFileStream has no events shown in web UI

2016-03-16 Thread Hao Ren
Just a quick question,

When using textFileStream, I did not see any events via web UI.
Actually, I am uploading files to s3 every 5 seconds,
And the mini-batch duration is 30 seconds.
On web ui,:

 *Input Rate*
Avg: 0.00 events/sec

But the schedule time and processing time are correct, and the output of
the steam is also correct. Not sure why web ui has not detected any events.

Thank you.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Re: The build-in indexes in ORC file does not work.

2016-03-16 Thread Mich Talebzadeh
Hi,

The parameters that control the stripe,  row group are configurable via the
ORC creation script

CREATE TABLE dummy (
 ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )
;
So in here I make my stripe quite small 16MB (as opposed to default of
64MB) and give row.index.stride = 1.

You can find out the available stats at row group for various
columns (0,1,2,3,...) by doing something like below

hive --orcfiledump --rowindex 0,1,2,3,4,5,6
/user/hive/warehouse/test.db/dummy/00_0

In reality I have found out that the only occasion the stats are used
is when you actually bucket the table in ORC or use partition. There are
also dependencies on the block size etc as well and how many rows in each
block. If the whole table fits in a block size I believe the stats are
ignored (at least this was the case in older versions of Hive (I use Hive 2)

check the optimiser plan with

explain extended   

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 10:23, Joseph  wrote:

> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are 
> easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and 
> position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not 
> control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to 
> create row indexes) and "orc.row.index.stride"(number of rows between index 
> entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the 
> filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate 
> push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the 
> filter column had a value beyond the range of data),  the performance when 
> enabled
>  predicate push-down was almost the same with when disabled predicate 
> push-down.  I think, when the filter column has a value beyond the range of 
> data, all of the orc files will not be scanned if use file level indexes,  so 
> the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when 
> enabled
>  predicate push-down was almost the same with when disabled predicate 
> push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the 
> performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> --
> Joseph
>


Re: spark.ml : eval model outside sparkContext

2016-03-16 Thread Peter Rudenko
Hi Emmanuel, looking for a similar solution. For now found only: 
https://github.com/truecar/mleap


Thanks,
Peter Rudenko

On 3/16/16 12:47 AM, Emmanuel wrote:

Hello,

In MLLib with Spark 1.4, I was able to eval a model by loading it and 
using `predict` on a vector of features.

I would train on Spark but use my model on my workflow.


In `spark.ml` it seems like the only way to eval is to use `transform` 
which only takes a DataFrame.
To build a DataFrame i need a sparkContext or SQLContext, so it 
doesn't seem to be possible to eval outside of Spark.



*Is there either a way to build a DataFrame without a sparkContext, or 
predict with a vector or list of features without a DataFrame?*

*
*
Thanks




Re: Reg:Reading a csv file with String label into labelepoint

2016-03-16 Thread Yanbo Liang
Actually it's unnecessary to convert csv row to LabeledPoint, because we
use DataFrame as the standard data format when training a model by Spark ML.
What you should do is converting double attributes to Vector named
"feature". Then you can train the ML model by specifying the featureCol and
labelCol.

Thanks
Yanbo

2016-03-16 13:41 GMT+08:00 Dharmin Siddesh J :

> Hi
>
> I am trying to read a csv with few double attributes and String Label .
> How can i convert it to labelpoint RDD so that i can run it with spark
> mllib classification algorithms.
>
> I have tried
> The LabelPoint Constructor (is available only for Regression ) but it
> accepts only double format label. Is there any other way to point out the
> string label and convert it into RDD
>
> Regards
> Siddesh
>
>
>


Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Saisai Shao
If you want to avoid existing job failure while restarting NM, you could
enable work preserving for NM, in this case, the restart of NM will not
affect the running containers (containers can still run). That could
alleviate NM restart problem.

Thanks
Saisai

On Wed, Mar 16, 2016 at 6:30 PM, Alex Dzhagriev  wrote:

> Hi Vinay,
>
> I believe it's not possible as the spark-shuffle code should run in the
> same JVM process as the Node Manager. I haven't heard anything about on the
> fly bytecode loading in the Node Manger.
>
> Thanks, Alex.
>
> On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap 
> wrote:
>
>> Hi all,
>>
>> I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5*
>>
>> As per the documentation to enable Dynamic Allocation of Executors in
>> Spark,
>> it is required to add the shuffle service jar to YARN Node Manager's
>> classpath and restart the YARN Node Manager.
>>
>> Is there any way to to dynamically supply the shuffle service jar
>> information from the application itself and avoid disturbing the running
>> YARN service.
>>
>> Tried couple of options by uploading the jar to hdfs and set
>> *yarn.application.classpath* but did not work. On container launch for
>> the executor it fails to recognize the shuffle service.
>>
>> Any help would be greatly appreciated.
>>
>> --
>> *Thanks and regards*
>> *Vinay Kashyap*
>>
>
>


Re: Spark Thriftserver

2016-03-16 Thread ayan guha
Thank you Jeff. However, I am more looking for fine grained access control.
For example: something like Ranger. Do you know if Spark thriftserver
supported by Ranger or Sentry? Or something similar? Much appreciated

On Wed, Mar 16, 2016 at 1:49 PM, Jeff Zhang  wrote:

> It's same as hive thrift server. I believe kerberos is supported.
>
> On Wed, Mar 16, 2016 at 10:48 AM, ayan guha  wrote:
>
>> so, how about implementing security? Any pointer will be helpful
>>
>> On Wed, Mar 16, 2016 at 1:44 PM, Jeff Zhang  wrote:
>>
>>> The spark thrift server allow you to run hive query in spark engine. It
>>> can be used as jdbc server.
>>>
>>> On Wed, Mar 16, 2016 at 10:42 AM, ayan guha  wrote:
>>>
 Sorry to be dumb-head today, but what is the purpose of spark
 thriftserver then? In other words, should I view spark thriftserver as a
 better version of hive one (with Spark as execution engine instead of
 MR/Tez) OR should I see it as a JDBC server?

 On Wed, Mar 16, 2016 at 11:44 AM, Jeff Zhang  wrote:

> spark thrift server is very similar with hive thrift server. You can
> use hive jdbc driver to access spark thrift server. AFAIK, all the 
> features
> of hive thrift server are also available in spark thrift server.
>
> On Wed, Mar 16, 2016 at 8:39 AM, ayan guha 
> wrote:
>
>> Hi All
>>
>> My understanding about thriftserver is to use it to expose pre-loaded
>> RDD/dataframes to tools who can connect through JDBC.
>>
>> Is there something like Spark JDBC server too? Does it do the same
>> thing? What is the difference between these two?
>>
>> How does Spark JDBC/Thrift supports security? Can we restrict certain
>> users to access certain dataframes and not the others?
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



 --
 Best Regards,
 Ayan Guha

>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards,
Ayan Guha


Re: Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Alex Dzhagriev
Hi Vinay,

I believe it's not possible as the spark-shuffle code should run in the
same JVM process as the Node Manager. I haven't heard anything about on the
fly bytecode loading in the Node Manger.

Thanks, Alex.

On Wed, Mar 16, 2016 at 10:12 AM, Vinay Kashyap  wrote:

> Hi all,
>
> I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5*
>
> As per the documentation to enable Dynamic Allocation of Executors in
> Spark,
> it is required to add the shuffle service jar to YARN Node Manager's
> classpath and restart the YARN Node Manager.
>
> Is there any way to to dynamically supply the shuffle service jar
> information from the application itself and avoid disturbing the running
> YARN service.
>
> Tried couple of options by uploading the jar to hdfs and set
> *yarn.application.classpath* but did not work. On container launch for
> the executor it fails to recognize the shuffle service.
>
> Any help would be greatly appreciated.
>
> --
> *Thanks and regards*
> *Vinay Kashyap*
>


The build-in indexes in ORC file does not work.

2016-03-16 Thread Joseph
Hi all,

I have known that ORC provides three level of indexes within each file, file 
level, stripe level, and row level. 
The file and stripe level statistics are in the file footer so that they are 
easy to access to determine if the rest of the file needs to be read at all. 
Row level indexes include both column statistics for each row group and 
position for seeking to the start of the row group. 

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control 
them.
2. The row level indexes can be configured by "orc.create.index"(whether to 
create row indexes) and "orc.row.index.stride"(number of rows between index 
entries).
3. Each Index has statistics of min, max for each column, so sort data by the 
filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate 
push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or 
hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 
and hive 1.2.1:
First, when the query statement with where clause did't match any record (the 
filter column had a value beyond the range of data),  the performance when 
enabled  predicate push-down was almost the same with when disabled predicate 
push-down.  I think, when the filter column has a value beyond the range of 
data, all of the orc files will not be scanned if use file level indexes,  so 
the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column 
and where clause can only match a few records, the performance when enabled  
predicate push-down was almost the same with when disabled predicate push-down. 

The third, when enabled  predicate push-down and "orc.create.index", the 
performance when  filter column had a value beyond the range of data was almost 
the same with when filter column had a value covering almost the whole data. 

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  
What's my issue?

Thanks!



Joseph


Enabling spark_shuffle service without restarting YARN Node Manager

2016-03-16 Thread Vinay Kashyap
Hi all,

I am using *Spark 1.5.1* in *yarn-client* mode along with *CDH 5.5*

As per the documentation to enable Dynamic Allocation of Executors in Spark,
it is required to add the shuffle service jar to YARN Node Manager's
classpath and restart the YARN Node Manager.

Is there any way to to dynamically supply the shuffle service jar
information from the application itself and avoid disturbing the running
YARN service.

Tried couple of options by uploading the jar to hdfs and set
*yarn.application.classpath* but did not work. On container launch for the
executor it fails to recognize the shuffle service.

Any help would be greatly appreciated.

-- 
*Thanks and regards*
*Vinay Kashyap*


Re: [Streaming] Difference between windowed stream and stream with large batch size?

2016-03-16 Thread Hao Ren
Any ideas ?

Feel free to ask me more details, if my questions are not clear.

Thank you.

On Mon, Mar 7, 2016 at 3:38 PM, Hao Ren  wrote:

> I want to understand the advantage of using windowed stream.
>
> For example,
>
> Stream 1:
> initial duration = 5 s,
> and then transformed into a stream windowed by (*windowLength = *30s, 
> *slideInterval
> = *30s)
>
> Stream 2:
> Duration = 30 s
>
> Questions:
>
> 1. Is Stream 1 equivalent to Stream 2 on behavior ? Do users observe the
> same result ?
> 2. If yes, what is the advantage of one vs. another ? Performance or
> something else ?
> 3. Is a stream with large batch reasonable , say 30 mins or even an hour ?
>
> Thank you.
>
> --
> Hao Ren
>
> Data Engineer @ leboncoin
>
> Paris, France
>



-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France


Get Pair of Topic and Message from Kafka + Spark Streaming

2016-03-16 Thread Imre Nagi
Hi,

I'm just trying to process the data that come from the kafka source in my
spark streaming application. What I want to do is get the pair of topic and
message in a tuple from the message stream.

Here is my streams:

 val streams = KafkaUtils.createDirectStream[String, Array[Byte],
> StringDecoder, DefaultDecoder](ssc,kafkaParameter,
>   Array["topic1", "topic2])


I have done several things, but still failed when i did some
transformations from the streams to the pair of topic and message. I hope
somebody can help me here.

Thanks,
Imre


Re: newbie HDFS S3 best practices

2016-03-16 Thread Chris Miller
If you have lots of small files, distcp should handle that well -- it's
supposed to distribute the transfer of files across the nodes in your
cluster. Conductor looks interesting if you're trying to distribute the
transfer of single, large file(s)...

right?

--
Chris Miller

On Wed, Mar 16, 2016 at 4:43 AM, Andy Davidson <
a...@santacruzintegration.com> wrote:

> Hi Frank
>
> We have thousands of small files . Each file is between 6K to maybe 100k.
>
> Conductor looks interesting
>
> Andy
>
> From: Frank Austin Nothaft 
> Date: Tuesday, March 15, 2016 at 11:59 AM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: newbie HDFS S3 best practices
>
> Hard to say with #1 without knowing your application’s characteristics;
> for #2, we use conductor  with
> IAM roles, .boto/.aws/credentials files.
>
> Frank Austin Nothaft
> fnoth...@berkeley.edu
> fnoth...@eecs.berkeley.edu
> 202-340-0466
>
> On Mar 15, 2016, at 11:45 AM, Andy Davidson  > wrote:
>
> We use the spark-ec2 script to create AWS clusters as needed (we do not
> use AWS EMR)
>
>1. will we get better performance if we copy data to HDFS before we
>run instead of reading directly from S3?
>
>  2. What is a good way to move results from HDFS to S3?
>
>
> It seems like there are many ways to bulk copy to s3. Many of them require
> we explicitly use the AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
> . This seems like a bad
> idea?
>
> What would you recommend?
>
> Thanks
>
> Andy
>
>
>
>


Re: Does parallelize and collect preserve the original order of list?

2016-03-16 Thread Chris Miller
Short answer: Nope

Less short answer: Spark is not designed to maintain sort order in this
case... it *may*, but there's no guarantee... generally, it would not be in
the same order unless you implement something to order by and then sort the
result based on that.

--
Chris Miller

On Wed, Mar 16, 2016 at 10:16 AM, JoneZhang  wrote:

> Step1
> List items = new ArrayList();items.addAll(XXX);
> javaSparkContext.parallelize(items).saveAsTextFile(output);
> Step2
> final List items2 = ctx.textFile(output).collect();
>
> Does items and items2 has the same order?
>
>
> Besh wishes.
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Does-parallelize-and-collect-preserve-the-original-order-of-list-tp26512.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: reading file from S3

2016-03-16 Thread Chris Miller
+1 for Sab's thoughtful answer...

Yasemin: As Gourav said, using IAM roles is considered best practice and
generally will give you fewer headaches in the end... but you may have a
reason for doing it the way you are, and certainly the way you posted
should be supported and not cause the error you described.

--
Chris Miller

On Tue, Mar 15, 2016 at 11:22 PM, Sabarish Sasidharan <
sabarish.sasidha...@manthan.com> wrote:

> There are many solutions to a problem.
>
> Also understand that sometimes your situation might be such. For ex what
> if you are accessing S3 from your Spark job running in your continuous
> integration server sitting in your data center or may be a box under your
> desk. And sometimes you are just trying something.
>
> Also understand that sometimes you want answers to solve your problem at
> hand without redirecting you to something else. Understand what you
> suggested is an appropriate way of doing it, which I myself have proposed
> before, but that doesn't solve the OP's problem at hand.
>
> Regards
> Sab
> On 15-Mar-2016 8:27 pm, "Gourav Sengupta" 
> wrote:
>
>> Oh!!! What the hell
>>
>> Please never use the URI
>>
>> *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause of
>> pain, security issues, code maintenance issues and ofcourse something that
>> Amazon strongly suggests that we do not use. Please use roles and you will
>> not have to worry about security.
>>
>> Regards,
>> Gourav Sengupta
>>
>> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
>> sabarish@gmail.com> wrote:
>>
>>> You have a slash before the bucket name. It should be @.
>>>
>>> Regards
>>> Sab
>>> On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:
>>>
 Hi,

 I am using Spark 1.6.0 standalone and I want to read a txt file from S3
 bucket named yasemindeneme and my file name is deneme.txt. But I am getting
 this error. Here is the simple code
 
 Exception in thread "main" java.lang.IllegalArgumentException: Invalid
 hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
 /yasemindeneme/deneme.txt
 at
 org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
 at
 org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)


 I try 2 options
 *sc.hadoopConfiguration() *and
 *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*

 Also I did export AWS_ACCESS_KEY_ID= .
  export AWS_SECRET_ACCESS_KEY=
 But there is no change about error.

 Could you please help me about this issue?


 --
 hiç ender hiç

>>>
>>


Re: reading file from S3

2016-03-16 Thread Yasemin Kaya
Hi,

Thanx a lot all, I understand my problem comes from *hadoop version* and I
move the spark 1.6.0 *hadoop 2.4 *version and there is no problem.

Best,
yasemin

2016-03-15 17:31 GMT+02:00 Gourav Sengupta :

> Once again, please use roles, there is no way that you have to specify the
> access keys in the URI under any situation. Please read Amazon
> documentation and they will say the same. The only situation when you use
> the access keys in URI is when you have not read the Amazon documentation :)
>
> Regards,
> Gourav
>
> On Tue, Mar 15, 2016 at 3:22 PM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
>> There are many solutions to a problem.
>>
>> Also understand that sometimes your situation might be such. For ex what
>> if you are accessing S3 from your Spark job running in your continuous
>> integration server sitting in your data center or may be a box under your
>> desk. And sometimes you are just trying something.
>>
>> Also understand that sometimes you want answers to solve your problem at
>> hand without redirecting you to something else. Understand what you
>> suggested is an appropriate way of doing it, which I myself have proposed
>> before, but that doesn't solve the OP's problem at hand.
>>
>> Regards
>> Sab
>> On 15-Mar-2016 8:27 pm, "Gourav Sengupta" 
>> wrote:
>>
>>> Oh!!! What the hell
>>>
>>> Please never use the URI
>>>
>>> *s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY.*That is a major cause
>>> of pain, security issues, code maintenance issues and ofcourse something
>>> that Amazon strongly suggests that we do not use. Please use roles and you
>>> will not have to worry about security.
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>> On Tue, Mar 15, 2016 at 2:38 PM, Sabarish Sasidharan <
>>> sabarish@gmail.com> wrote:
>>>
 You have a slash before the bucket name. It should be @.

 Regards
 Sab
 On 15-Mar-2016 4:03 pm, "Yasemin Kaya"  wrote:

> Hi,
>
> I am using Spark 1.6.0 standalone and I want to read a txt file from
> S3 bucket named yasemindeneme and my file name is deneme.txt. But I am
> getting this error. Here is the simple code
> 
> Exception in thread "main" java.lang.IllegalArgumentException: Invalid
> hostname in URI s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@
> /yasemindeneme/deneme.txt
> at
> org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:45)
> at
> org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.initialize(Jets3tNativeFileSystemStore.java:55)
>
>
> I try 2 options
> *sc.hadoopConfiguration() *and
> *sc.textFile("s3n://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@/yasemindeneme/deneme.txt/");*
>
> Also I did export AWS_ACCESS_KEY_ID= .
>  export AWS_SECRET_ACCESS_KEY=
> But there is no change about error.
>
> Could you please help me about this issue?
>
>
> --
> hiç ender hiç
>

>>>
>


-- 
hiç ender hiç


Re: exception while running job as pyspark

2016-03-16 Thread Jeff Zhang
Please try export PYSPARK_PYTHON=

On Wed, Mar 16, 2016 at 3:00 PM, ram kumar  wrote:

> Hi,
>
> I get the following error when running a job as pyspark,
>
> {{{
>  An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 6, ): java.io.IOException: Cannot run program "python2.7":
> error=2, No such file or directory
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
> at
> org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
> at
> org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
> at
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
> at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
> at
> org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
> at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: error=2, No such file or directory
> at java.lang.UNIXProcess.forkAndExec(Native Method)
> at java.lang.UNIXProcess.(UNIXProcess.java:248)
> at java.lang.ProcessImpl.start(ProcessImpl.java:134)
> at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
> ... 36 more
>
> }}}
>
> python2.7 couldn't found. But i m using vertual env python 2.7
> {{{
> [ram@test-work workspace]$ python
> Python 2.7.8 (default, Mar 15 2016, 04:37:00)
> [GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>>
> }}}
>
> Can anyone help me with this?
> Thanks
>



-- 
Best Regards

Jeff Zhang


exception while running job as pyspark

2016-03-16 Thread ram kumar
Hi,

I get the following error when running a job as pyspark,

{{{
 An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 6, ): java.io.IOException: Cannot run program "python2.7":
error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at
org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
at
org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at
org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:101)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:300)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.(UNIXProcess.java:248)
at java.lang.ProcessImpl.start(ProcessImpl.java:134)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
... 36 more

}}}

python2.7 couldn't found. But i m using vertual env python 2.7
{{{
[ram@test-work workspace]$ python
Python 2.7.8 (default, Mar 15 2016, 04:37:00)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
}}}

Can anyone help me with this?
Thanks


Unable to serialize dataframe

2016-03-16 Thread shyam
I am using Spark version 1.5.2 with Sparklingwater version 1.5.2 .

I am getting a runtime error when I try to write a dataframe to disk. My
code looks as shown below,

val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",inferSchema.toString).option("header",isHeader.toString).load(datapath)

df.write.parquet("some path")

I am getting a runtime error that says "Failed to load class for data
source: parquet."

I tried writing as json using df.write.json function then also I got similar
error for json.

I am new to spark. Am I missing anything here ? 

Thanks,
Shyam



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Unable-to-serialize-dataframe-tp26516.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org