PySpark tests are failed with the java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.sources.FakeSourceOne not found

2023-04-12 Thread Ranga Reddy
Hi Team,

I am running the pyspark tests in Spark version and it failed with P*rovider
org.apache.spark.sql.sources.FakeSourceOne not found.*

Spark Version: 3.4.0/3.5.0
Python Version: 3.8.10
OS: Ubuntu 20.04


*Steps: *

# /opt/data/spark/build/sbt -Phive clean package
# /opt/data/spark/build/sbt test:compile
# pip3 install -r /opt/data/spark/dev/requirements.txt
# /opt/data/spark/python/run-tests --python-executables=python3

*Exception:*

==
ERROR [15.081s]: test_read_images
(pyspark.ml.tests.test_image.ImageFileFormatTest)
--
Traceback (most recent call last):
File "/opt/data/spark/python/pyspark/ml/tests/test_image.py", line 29, in
test_read_images
self.spark.read.format("image")
File "/opt/data/spark/python/pyspark/sql/readwriter.py", line 300, in load
return self._df(self._jreader.load(path))
File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py"
, line 1322, in __call__
return_value = get_return_value(
File "/opt/data/spark/python/pyspark/errors/exceptions/captured.py", line
176, in deco
return f(*a, **kw)
File "/opt/data/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py",
line 326, in get_return_value
raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o33.load.
: java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.sql.sources.FakeSourceOne not found
at java.util.ServiceLoader.fail(ServiceLoader.java:239)
at java.util.ServiceLoader.access$300(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:372)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper
.next(Wrappers.scala:46)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource$
.lookupDataSource(DataSource.scala:629)
at org.apache.spark.sql.execution.datasources.DataSource$
.lookupDataSourceV2(DataSource.scala:697)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:208)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:186)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.lang.Thread.run(Thread.java:750)


Could someone help me how to proceed further?


-- 
Thanks and Regards


*Ranga Reddy*
*--*

*Bangalore, Karnataka, India*
*Mobile : +91-9986183183 |  Email: rangareddy.av...@gmail.com
*


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks Ted. Will do.

On Wed, Mar 18, 2015 at 2:27 PM, Ted Yu  wrote:

> Ranga:
> Please apply the patch from:
> https://github.com/apache/spark/pull/4867
>
> And rebuild Spark - the build would use Tachyon-0.6.1
>
> Cheers
>
> On Wed, Mar 18, 2015 at 2:23 PM, Ranga  wrote:
>
>> Hi Haoyuan
>>
>> No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
>> not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
>> Thanks for your help.
>>
>>
>> - Ranga
>>
>> On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li 
>> wrote:
>>
>>> Did you recompile it with Tachyon 0.6.0?
>>>
>>> Also, Tachyon 0.6.1 has been released this morning:
>>> http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases
>>>
>>> Best regards,
>>>
>>> Haoyuan
>>>
>>> On Wed, Mar 18, 2015 at 11:48 AM, Ranga  wrote:
>>>
>>>> I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
>>>> issue. Here are the logs:
>>>> 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
>>>> 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
>>>> 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts
>>>> to create tachyon dir in
>>>> /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/
>>>>
>>>> Thanks for any other pointers.
>>>>
>>>>
>>>> - Ranga
>>>>
>>>>
>>>>
>>>> On Wed, Mar 18, 2015 at 9:53 AM, Ranga  wrote:
>>>>
>>>>> Thanks for the information. Will rebuild with 0.6.0 till the patch is
>>>>> merged.
>>>>>
>>>>> On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu  wrote:
>>>>>
>>>>>> Ranga:
>>>>>> Take a look at https://github.com/apache/spark/pull/4867
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com >>>>> > wrote:
>>>>>>
>>>>>>> Hi, Ranga
>>>>>>>
>>>>>>> That's true. Typically a version mis-match issue. Note that spark
>>>>>>> 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to
>>>>>>> rebuild spark
>>>>>>> with your current tachyon release.
>>>>>>> We had used tachyon for several of our spark projects in a
>>>>>>> production environment.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sun.
>>>>>>>
>>>>>>> --
>>>>>>> fightf...@163.com
>>>>>>>
>>>>>>>
>>>>>>> *From:* Ranga 
>>>>>>> *Date:* 2015-03-18 06:45
>>>>>>> *To:* user@spark.apache.org
>>>>>>> *Subject:* StorageLevel: OFF_HEAP
>>>>>>> Hi
>>>>>>>
>>>>>>> I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
>>>>>>> cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
>>>>>>> However, when I try to persist the RDD, I get the following error:
>>>>>>>
>>>>>>> ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
>>>>>>> TachyonFS.java[connect]:364)  - Invalid method name:
>>>>>>> 'getUserUnderfsTempFolder'
>>>>>>> ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
>>>>>>> TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'
>>>>>>>
>>>>>>> Is this because of a version mis-match?
>>>>>>>
>>>>>>> On a different note, I was wondering if Tachyon has been used in a
>>>>>>> production environment by anybody in this group?
>>>>>>>
>>>>>>> Appreciate your help with this.
>>>>>>>
>>>>>>>
>>>>>>> - Ranga
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Haoyuan Li
>>> AMPLab, EECS, UC Berkeley
>>> http://www.cs.berkeley.edu/~haoyuan/
>>>
>>
>>
>


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Hi Haoyuan

No. I assumed that Spark-1.3.0 was already built with Tachyon-0.6.0. If
not, I can rebuild and try. Could you let me know how to rebuild with 0.6.0?
Thanks for your help.


- Ranga

On Wed, Mar 18, 2015 at 12:59 PM, Haoyuan Li  wrote:

> Did you recompile it with Tachyon 0.6.0?
>
> Also, Tachyon 0.6.1 has been released this morning:
> http://tachyon-project.org/ ; https://github.com/amplab/tachyon/releases
>
> Best regards,
>
> Haoyuan
>
> On Wed, Mar 18, 2015 at 11:48 AM, Ranga  wrote:
>
>> I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
>> issue. Here are the logs:
>> 15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
>> 15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
>> 15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts
>> to create tachyon dir in
>> /tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/
>>
>> Thanks for any other pointers.
>>
>>
>> - Ranga
>>
>>
>>
>> On Wed, Mar 18, 2015 at 9:53 AM, Ranga  wrote:
>>
>>> Thanks for the information. Will rebuild with 0.6.0 till the patch is
>>> merged.
>>>
>>> On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu  wrote:
>>>
>>>> Ranga:
>>>> Take a look at https://github.com/apache/spark/pull/4867
>>>>
>>>> Cheers
>>>>
>>>> On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com 
>>>> wrote:
>>>>
>>>>> Hi, Ranga
>>>>>
>>>>> That's true. Typically a version mis-match issue. Note that spark
>>>>> 1.2.1 has tachyon built in with version 0.5.0 , I think you may need to
>>>>> rebuild spark
>>>>> with your current tachyon release.
>>>>> We had used tachyon for several of our spark projects in a production
>>>>> environment.
>>>>>
>>>>> Thanks,
>>>>> Sun.
>>>>>
>>>>> --
>>>>> fightf...@163.com
>>>>>
>>>>>
>>>>> *From:* Ranga 
>>>>> *Date:* 2015-03-18 06:45
>>>>> *To:* user@spark.apache.org
>>>>> *Subject:* StorageLevel: OFF_HEAP
>>>>> Hi
>>>>>
>>>>> I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
>>>>> cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
>>>>> However, when I try to persist the RDD, I get the following error:
>>>>>
>>>>> ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
>>>>> TachyonFS.java[connect]:364)  - Invalid method name:
>>>>> 'getUserUnderfsTempFolder'
>>>>> ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
>>>>> TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'
>>>>>
>>>>> Is this because of a version mis-match?
>>>>>
>>>>> On a different note, I was wondering if Tachyon has been used in a
>>>>> production environment by anybody in this group?
>>>>>
>>>>> Appreciate your help with this.
>>>>>
>>>>>
>>>>> - Ranga
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Haoyuan Li
> AMPLab, EECS, UC Berkeley
> http://www.cs.berkeley.edu/~haoyuan/
>


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
I just tested with Spark-1.3.0 + Tachyon-0.6.0 and still see the same
issue. Here are the logs:
15/03/18 11:44:11 ERROR : Invalid method name: 'getDataFolder'
15/03/18 11:44:11 ERROR : Invalid method name: 'user_getFileId'
15/03/18 11:44:11 ERROR storage.TachyonBlockManager: Failed 10 attempts to
create tachyon dir in
/tmp_spark_tachyon/spark-e3538a20-5e42-48a4-ad67-4b97aded90e4/

Thanks for any other pointers.


- Ranga



On Wed, Mar 18, 2015 at 9:53 AM, Ranga  wrote:

> Thanks for the information. Will rebuild with 0.6.0 till the patch is
> merged.
>
> On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu  wrote:
>
>> Ranga:
>> Take a look at https://github.com/apache/spark/pull/4867
>>
>> Cheers
>>
>> On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com 
>> wrote:
>>
>>> Hi, Ranga
>>>
>>> That's true. Typically a version mis-match issue. Note that spark 1.2.1
>>> has tachyon built in with version 0.5.0 , I think you may need to rebuild
>>> spark
>>> with your current tachyon release.
>>> We had used tachyon for several of our spark projects in a production
>>> environment.
>>>
>>> Thanks,
>>> Sun.
>>>
>>> --
>>> fightf...@163.com
>>>
>>>
>>> *From:* Ranga 
>>> *Date:* 2015-03-18 06:45
>>> *To:* user@spark.apache.org
>>> *Subject:* StorageLevel: OFF_HEAP
>>> Hi
>>>
>>> I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
>>> cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
>>> However, when I try to persist the RDD, I get the following error:
>>>
>>> ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
>>> TachyonFS.java[connect]:364)  - Invalid method name:
>>> 'getUserUnderfsTempFolder'
>>> ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
>>> TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'
>>>
>>> Is this because of a version mis-match?
>>>
>>> On a different note, I was wondering if Tachyon has been used in a
>>> production environment by anybody in this group?
>>>
>>> Appreciate your help with this.
>>>
>>>
>>> - Ranga
>>>
>>>
>>
>


Re: StorageLevel: OFF_HEAP

2015-03-18 Thread Ranga
Thanks for the information. Will rebuild with 0.6.0 till the patch is
merged.

On Tue, Mar 17, 2015 at 7:24 PM, Ted Yu  wrote:

> Ranga:
> Take a look at https://github.com/apache/spark/pull/4867
>
> Cheers
>
> On Tue, Mar 17, 2015 at 6:08 PM, fightf...@163.com 
> wrote:
>
>> Hi, Ranga
>>
>> That's true. Typically a version mis-match issue. Note that spark 1.2.1
>> has tachyon built in with version 0.5.0 , I think you may need to rebuild
>> spark
>> with your current tachyon release.
>> We had used tachyon for several of our spark projects in a production
>> environment.
>>
>> Thanks,
>> Sun.
>>
>> --
>> fightf...@163.com
>>
>>
>> *From:* Ranga 
>> *Date:* 2015-03-18 06:45
>> *To:* user@spark.apache.org
>> *Subject:* StorageLevel: OFF_HEAP
>> Hi
>>
>> I am trying to use the OFF_HEAP storage level in my Spark (1.2.1)
>> cluster. The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running.
>> However, when I try to persist the RDD, I get the following error:
>>
>> ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
>> TachyonFS.java[connect]:364)  - Invalid method name:
>> 'getUserUnderfsTempFolder'
>> ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
>> TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'
>>
>> Is this because of a version mis-match?
>>
>> On a different note, I was wondering if Tachyon has been used in a
>> production environment by anybody in this group?
>>
>> Appreciate your help with this.
>>
>>
>> - Ranga
>>
>>
>


StorageLevel: OFF_HEAP

2015-03-17 Thread Ranga
Hi

I am trying to use the OFF_HEAP storage level in my Spark (1.2.1) cluster.
The Tachyon (0.6.0-SNAPSHOT) nodes seem to be up and running. However, when
I try to persist the RDD, I get the following error:

ERROR [2015-03-16 22:22:54,017] ({Executor task launch worker-0}
TachyonFS.java[connect]:364)  - Invalid method name:
'getUserUnderfsTempFolder'
ERROR [2015-03-16 22:22:54,050] ({Executor task launch worker-0}
TachyonFS.java[getFileId]:1020)  - Invalid method name: 'user_getFileId'

Is this because of a version mis-match?

On a different note, I was wondering if Tachyon has been used in a
production environment by anybody in this group?

Appreciate your help with this.


- Ranga


Re: Spark Streaming: HiveContext within Custom Actor

2014-12-30 Thread Ranga
Thanks. Will look at other options.

On Tue, Dec 30, 2014 at 11:43 AM, Tathagata Das  wrote:

> I am not sure that can be done. Receivers are designed to be run only
> on the executors/workers, whereas a SQLContext (for using Spark SQL)
> can only be defined on the driver.
>
>
> On Mon, Dec 29, 2014 at 6:45 PM, sranga  wrote:
> > Hi
> >
> > Could Spark-SQL be used from within a custom actor that acts as a
> receiver
> > for a streaming application? If yes, what is the recommended way of
> passing
> > the SparkContext to the actor?
> > Thanks for your help.
> >
> >
> > - Ranga
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-HiveContext-within-Custom-Actor-tp20892.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>


Re: RDDs being cleaned too fast

2014-12-11 Thread Ranga
I was having similar issues with my persistent RDDs. After some digging
around, I noticed that the partitions were not balanced evenly across the
available nodes. After a "repartition", the RDD was spread evenly across
all available memory. Not sure if that is something that would help your
use-case though.
You could also increase the spark.storage.memoryFraction if that is an
option.


- Ranga

On Wed, Dec 10, 2014 at 10:23 PM, Aaron Davidson  wrote:

> The ContextCleaner uncaches RDDs that have gone out of scope on the
> driver. So it's possible that the given RDD is no longer reachable in your
> program's control flow, or else it'd be a bug in the ContextCleaner.
>
> On Wed, Dec 10, 2014 at 5:34 PM, ankits  wrote:
>
>> I'm using spark 1.1.0 and am seeing persisted RDDs being cleaned up too
>> fast.
>> How can i inspect the size of RDD in memory and get more information about
>> why it was cleaned up. There should be more than enough memory available
>> on
>> the cluster to store them, and by default, the spark.cleaner.ttl is
>> infinite, so I want more information about why this is happening and how
>> to
>> prevent it.
>>
>> Spark just logs this when removing RDDs:
>>
>> [2014-12-11 01:19:34,006] INFO  spark.storage.BlockManager [] [] -
>> Removing
>> RDD 33
>> [2014-12-11 01:19:34,010] INFO  pache.spark.ContextCleaner []
>> [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
>> [2014-12-11 01:19:34,012] INFO  spark.storage.BlockManager [] [] -
>> Removing
>> RDD 33
>> [2014-12-11 01:19:34,016] INFO  pache.spark.ContextCleaner []
>> [akka://JobServer/user/context-supervisor/job-context1] - Cleaned RDD 33
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDDs-being-cleaned-too-fast-tp20613.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks Rishi. That is exactly what I am trying to do now :)

On Tue, Oct 14, 2014 at 2:41 PM, Rishi Pidva  wrote:

>
> As per EMR documentation:
> http://docs.amazonaws.cn/en_us/ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html
> Access AWS Resources Using IAM Roles
>
> If you've launched your cluster with an IAM role, applications running on
> the EC2 instances of that cluster can use the IAM role to obtain temporary
> account credentials to use when calling services in AWS.
>
> The version of Hadoop available on AMI 2.3.0 and later has already been
> updated to make use of IAM roles. If your application runs strictly on top
> of the Hadoop architecture, and does not directly call any service in AWS,
> it should work with IAM roles with no modification.
>
> If your application calls services in AWS directly, you'll need to update
> it to take advantage of IAM roles. This means that instead of obtaining
> account credentials from/home/hadoop/conf/core-site.xml on the EC2
> instances in the cluster, your application will now either use an SDK to
> access the resources using IAM roles, or call the EC2 instance metadata to
> obtain the temporary credentials.
> --
>
> Maybe you can use AWS SDK in your application to provide AWS credentials?
>
> https://github.com/seratch/AWScala
>
>
> On Oct 14, 2014, at 11:10 AM, Ranga  wrote:
>
> One related question. Could I specify the "
> com.amazonaws.services.s3.AmazonS3Client" implementation for the  "
> fs.s3.impl" parameter? Let me try that and update this thread with my
> findings.
>
> On Tue, Oct 14, 2014 at 10:48 AM, Ranga  wrote:
>
>> Thanks for the input.
>> Yes, I did use the "temporary" access credentials provided by the IAM
>> role (also detailed in the link you provided). The session token needs to
>> be specified and I was looking for a way to set that in the header (which
>> doesn't seem possible).
>> Looks like a static key/secret is the only option.
>>
>> On Tue, Oct 14, 2014 at 10:32 AM, Gen  wrote:
>>
>>> Hi,
>>>
>>> If I remember well, spark cannot use the IAMrole credentials to access to
>>> s3. It use at first the id/key in the environment. If it is null in the
>>> environment, it use the value in the file core-site.xml.  So, IAMrole is
>>> not
>>> useful for spark. The same problem happens if you want to use distcp
>>> command
>>> in hadoop.
>>>
>>>
>>> Do you use curl http://169.254.169.254/latest/meta-data/iam/.
>>> <http://169.254.169.254/latest/meta-data/iam/>.. to get the
>>> "temporary" access. If yes, this code cannot use directly by spark, for
>>> more
>>> information, you can take a look
>>> http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
>>> <http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html>
>>>
>>>
>>>
>>> sranga wrote
>>> > Thanks for the pointers.
>>> > I verified that the access key-id/secret used are valid. However, the
>>> > secret may contain "/" at times. The issues I am facing are as follows:
>>> >
>>> >- The EC2 instances are setup with an IAMRole () and don't have a
>>> > static
>>> >    key-id/secret
>>> >- All of the EC2 instances have access to S3 based on this role (I
>>> used
>>> >s3ls and s3cp commands to verify this)
>>> >- I can get a "temporary" access key-id/secret based on the IAMRole
>>> but
>>> >they generally expire in an hour
>>> >- If Spark is not able to use the IAMRole credentials, I may have to
>>> >generate a static key-id/secret. This may or may not be possible in
>>> the
>>> >environment I am in (from a policy perspective)
>>> >
>>> >
>>> >
>>> > - Ranga
>>> >
>>> > On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny <
>>>
>>> > mag@
>>>
>>> > > wrote:
>>> >
>>> >> Hi,
>>> >> keep in mind that you're going to have a bad time if your secret key
>>> >> contains a "/"
>>> >> This is due to old and stupid hadoop bug:
>>> >> https://issues.apache.org/jira/browse/HADOOP-3733
>>> >>
>>> >> Best way is to regenerate the key so it does not include a "/"
>>> >>
>>> >> /Raf
>>> >>
>>> >>
&g

Re: S3 Bucket Access

2014-10-14 Thread Ranga
One related question. Could I specify the "
com.amazonaws.services.s3.AmazonS3Client" implementation for the  "
fs.s3.impl" parameter? Let me try that and update this thread with my
findings.

On Tue, Oct 14, 2014 at 10:48 AM, Ranga  wrote:

> Thanks for the input.
> Yes, I did use the "temporary" access credentials provided by the IAM role
> (also detailed in the link you provided). The session token needs to be
> specified and I was looking for a way to set that in the header (which
> doesn't seem possible).
> Looks like a static key/secret is the only option.
>
> On Tue, Oct 14, 2014 at 10:32 AM, Gen  wrote:
>
>> Hi,
>>
>> If I remember well, spark cannot use the IAMrole credentials to access to
>> s3. It use at first the id/key in the environment. If it is null in the
>> environment, it use the value in the file core-site.xml.  So, IAMrole is
>> not
>> useful for spark. The same problem happens if you want to use distcp
>> command
>> in hadoop.
>>
>>
>> Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get
>> the
>> "temporary" access. If yes, this code cannot use directly by spark, for
>> more
>> information, you can take a look
>> http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
>> <http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html>
>>
>>
>>
>> sranga wrote
>> > Thanks for the pointers.
>> > I verified that the access key-id/secret used are valid. However, the
>> > secret may contain "/" at times. The issues I am facing are as follows:
>> >
>> >- The EC2 instances are setup with an IAMRole () and don't have a
>> > static
>> >key-id/secret
>> >- All of the EC2 instances have access to S3 based on this role (I
>> used
>> >s3ls and s3cp commands to verify this)
>> >- I can get a "temporary" access key-id/secret based on the IAMRole
>> but
>> >they generally expire in an hour
>> >- If Spark is not able to use the IAMRole credentials, I may have to
>> >generate a static key-id/secret. This may or may not be possible in
>> the
>> >environment I am in (from a policy perspective)
>> >
>> >
>> >
>> > - Ranga
>> >
>> > On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny <
>>
>> > mag@
>>
>> > > wrote:
>> >
>> >> Hi,
>> >> keep in mind that you're going to have a bad time if your secret key
>> >> contains a "/"
>> >> This is due to old and stupid hadoop bug:
>> >> https://issues.apache.org/jira/browse/HADOOP-3733
>> >>
>> >> Best way is to regenerate the key so it does not include a "/"
>> >>
>> >> /Raf
>> >>
>> >>
>> >> Akhil Das wrote:
>> >>
>> >> Try the following:
>> >>
>> >> 1. Set the access key and secret key in the sparkContext:
>> >>
>> >> sparkContext.set("
>> >>> ​
>> >>> AWS_ACCESS_KEY_ID",yourAccessKey)
>> >>
>> >> sparkContext.set("
>> >>> ​
>> >>> AWS_SECRET_ACCESS_KEY",yourSecretKey)
>> >>
>> >>
>> >> 2. Set the access key and secret key in the environment before starting
>> >> your application:
>> >>
>> >> ​
>> >>>
>> >> export
>> >>> ​​
>> >>> AWS_ACCESS_KEY_ID=
>> > 
>> >>
>> >> export
>> >>> ​​
>> >>> AWS_SECRET_ACCESS_KEY=
>> > 
>> > ​
>> >>
>> >>
>> >> 3. Set the access key and secret key inside the hadoop configurations
>> >>
>> >> val hadoopConf=sparkContext.hadoopConfiguration;
>> >>>
>> >>> hadoopConf.set("fs.s3.impl",
>> >>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>> >>>
>> >>> hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)
>> >>>
>> >>> hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey)
>> >>>
>> >>>
>> >> 4. You can also try:
>> >>
>> >> val lines =
>> >>
>> >> ​s
>> >>> parkContext.textFile("s3n://yourAccessKey:yourSecretKey@
>> >>&g

Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the input.
Yes, I did use the "temporary" access credentials provided by the IAM role
(also detailed in the link you provided). The session token needs to be
specified and I was looking for a way to set that in the header (which
doesn't seem possible).
Looks like a static key/secret is the only option.

On Tue, Oct 14, 2014 at 10:32 AM, Gen  wrote:

> Hi,
>
> If I remember well, spark cannot use the IAMrole credentials to access to
> s3. It use at first the id/key in the environment. If it is null in the
> environment, it use the value in the file core-site.xml.  So, IAMrole is
> not
> useful for spark. The same problem happens if you want to use distcp
> command
> in hadoop.
>
>
> Do you use curl http://169.254.169.254/latest/meta-data/iam/... to get the
> "temporary" access. If yes, this code cannot use directly by spark, for
> more
> information, you can take a look
> http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html
> <http://docs.aws.amazon.com/STS/latest/UsingSTS/using-temp-creds.html>
>
>
>
> sranga wrote
> > Thanks for the pointers.
> > I verified that the access key-id/secret used are valid. However, the
> > secret may contain "/" at times. The issues I am facing are as follows:
> >
> >- The EC2 instances are setup with an IAMRole () and don't have a
> > static
> >key-id/secret
> >- All of the EC2 instances have access to S3 based on this role (I
> used
> >s3ls and s3cp commands to verify this)
> >- I can get a "temporary" access key-id/secret based on the IAMRole
> but
> >they generally expire in an hour
> >- If Spark is not able to use the IAMRole credentials, I may have to
> >generate a static key-id/secret. This may or may not be possible in
> the
> >environment I am in (from a policy perspective)
> >
> >
> >
> > - Ranga
> >
> > On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny <
>
> > mag@
>
> > > wrote:
> >
> >> Hi,
> >> keep in mind that you're going to have a bad time if your secret key
> >> contains a "/"
> >> This is due to old and stupid hadoop bug:
> >> https://issues.apache.org/jira/browse/HADOOP-3733
> >>
> >> Best way is to regenerate the key so it does not include a "/"
> >>
> >> /Raf
> >>
> >>
> >> Akhil Das wrote:
> >>
> >> Try the following:
> >>
> >> 1. Set the access key and secret key in the sparkContext:
> >>
> >> sparkContext.set("
> >>> ​
> >>> AWS_ACCESS_KEY_ID",yourAccessKey)
> >>
> >> sparkContext.set("
> >>> ​
> >>> AWS_SECRET_ACCESS_KEY",yourSecretKey)
> >>
> >>
> >> 2. Set the access key and secret key in the environment before starting
> >> your application:
> >>
> >> ​
> >>>
> >> export
> >>> ​​
> >>> AWS_ACCESS_KEY_ID=
> > 
> >>
> >> export
> >>> ​​
> >>> AWS_SECRET_ACCESS_KEY=
> > 
> > ​
> >>
> >>
> >> 3. Set the access key and secret key inside the hadoop configurations
> >>
> >> val hadoopConf=sparkContext.hadoopConfiguration;
> >>>
> >>> hadoopConf.set("fs.s3.impl",
> >>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
> >>>
> >>> hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)
> >>>
> >>> hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey)
> >>>
> >>>
> >> 4. You can also try:
> >>
> >> val lines =
> >>
> >> ​s
> >>> parkContext.textFile("s3n://yourAccessKey:yourSecretKey@
> >>>
> > 
> > /path/")
> >>
> >>
> >> Thanks
> >> Best Regards
> >>
> >> On Mon, Oct 13, 2014 at 11:33 PM, Ranga <
>
> > sranga@
>
> > > wrote:
> >>
> >>> Hi
> >>>
> >>> I am trying to access files/buckets in S3 and encountering a
> permissions
> >>> issue. The buckets are configured to authenticate using an IAMRole
> >>> provider.
> >>> I have set the KeyId and Secret using environment variables (
> >>> AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
> unable
> >>> to access the S3 buckets.
> >>>
> >>> Before setting the access key and secret the error was:
> >>> "java.lang.IllegalArgumentException:
> >>> AWS Access Key ID and Secret Access Key must be specified as the
> >>> username
> >>> or password (respectively) of a s3n URL, or by setting the
> >>> fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
> >>> (respectively)."
> >>>
> >>> After setting the access key and secret, the error is: "The AWS Access
> >>> Key Id you provided does not exist in our records."
> >>>
> >>> The id/secret being set are the right values. This makes me believe
> that
> >>> something else ("token", etc.) needs to be set as well.
> >>> Any help is appreciated.
> >>>
> >>>
> >>> - Ranga
> >>>
> >>
> >>
> >>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/S3-Bucket-Access-tp16303p16397.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: S3 Bucket Access

2014-10-14 Thread Ranga
Thanks for the pointers.
I verified that the access key-id/secret used are valid. However, the
secret may contain "/" at times. The issues I am facing are as follows:

   - The EC2 instances are setup with an IAMRole () and don't have a static
   key-id/secret
   - All of the EC2 instances have access to S3 based on this role (I used
   s3ls and s3cp commands to verify this)
   - I can get a "temporary" access key-id/secret based on the IAMRole but
   they generally expire in an hour
   - If Spark is not able to use the IAMRole credentials, I may have to
   generate a static key-id/secret. This may or may not be possible in the
   environment I am in (from a policy perspective)



- Ranga

On Tue, Oct 14, 2014 at 4:21 AM, Rafal Kwasny  wrote:

> Hi,
> keep in mind that you're going to have a bad time if your secret key
> contains a "/"
> This is due to old and stupid hadoop bug:
> https://issues.apache.org/jira/browse/HADOOP-3733
>
> Best way is to regenerate the key so it does not include a "/"
>
> /Raf
>
>
> Akhil Das wrote:
>
> Try the following:
>
> 1. Set the access key and secret key in the sparkContext:
>
> sparkContext.set("
>> ​
>> AWS_ACCESS_KEY_ID",yourAccessKey)
>
> sparkContext.set("
>> ​
>> AWS_SECRET_ACCESS_KEY",yourSecretKey)
>
>
> 2. Set the access key and secret key in the environment before starting
> your application:
>
> ​
>>
> export
>> ​​
>> AWS_ACCESS_KEY_ID=
>
> export
>> ​​
>> AWS_SECRET_ACCESS_KEY=​
>
>
> 3. Set the access key and secret key inside the hadoop configurations
>
> val hadoopConf=sparkContext.hadoopConfiguration;
>>
>> hadoopConf.set("fs.s3.impl",
>>> "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
>>
>> hadoopConf.set("fs.s3.awsAccessKeyId",yourAccessKey)
>>
>> hadoopConf.set("fs.s3.awsSecretAccessKey",yourSecretKey)
>>
>>
> 4. You can also try:
>
> val lines =
>
> ​s
>> parkContext.textFile("s3n://yourAccessKey:yourSecretKey@
>> /path/")
>
>
> Thanks
> Best Regards
>
> On Mon, Oct 13, 2014 at 11:33 PM, Ranga  wrote:
>
>> Hi
>>
>> I am trying to access files/buckets in S3 and encountering a permissions
>> issue. The buckets are configured to authenticate using an IAMRole provider.
>> I have set the KeyId and Secret using environment variables (
>> AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable
>> to access the S3 buckets.
>>
>> Before setting the access key and secret the error was: 
>> "java.lang.IllegalArgumentException:
>> AWS Access Key ID and Secret Access Key must be specified as the username
>> or password (respectively) of a s3n URL, or by setting the
>> fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
>> (respectively)."
>>
>> After setting the access key and secret, the error is: "The AWS Access
>> Key Id you provided does not exist in our records."
>>
>> The id/secret being set are the right values. This makes me believe that
>> something else ("token", etc.) needs to be set as well.
>> Any help is appreciated.
>>
>>
>> - Ranga
>>
>
>
>


Re: S3 Bucket Access

2014-10-13 Thread Ranga
Hi Daniil

Could you provide some more details on how the cluster should be
launched/configured? The EC2 instance that I am dealing with uses the
concept of IAMRoles. I don't have any "keyfile" to specify to the spark-ec2
script.
Thanks for your help.


- Ranga

On Mon, Oct 13, 2014 at 3:04 PM, Daniil Osipov 
wrote:

> (Copying the user list)
> You should use spark_ec2 script to configure the cluster. If you use trunk
> version you can use the new --copy-aws-credentials option to configure the
> S3 parameters automatically, otherwise either include them in your
> SparkConf variable or add them to
> /root/spark/ephemeral-hdfs/conf/core-site.xml
>
> On Mon, Oct 13, 2014 at 2:56 PM, Ranga  wrote:
>
>> The cluster is deployed on EC2 and I am trying to access the S3 files
>> from within a spark-shell session.
>>
>> On Mon, Oct 13, 2014 at 2:51 PM, Daniil Osipov 
>> wrote:
>>
>>> So is your cluster running on EC2, or locally? If you're running
>>> locally, you should still be able to access S3 files, you just need to
>>> locate the core-site.xml and add the parameters as defined in the error.
>>>
>>> On Mon, Oct 13, 2014 at 2:49 PM, Ranga  wrote:
>>>
>>>> Hi Daniil
>>>>
>>>> No. I didn't create the spark-cluster using the ec2 scripts. Is that
>>>> something that I need to do? I just downloaded Spark-1.1.0 and Hadoop-2.4.
>>>> However, I am trying to access files on S3 from this cluster.
>>>>
>>>>
>>>> - Ranga
>>>>
>>>> On Mon, Oct 13, 2014 at 2:36 PM, Daniil Osipov <
>>>> daniil.osi...@shazam.com> wrote:
>>>>
>>>>> Did you add the fs.s3n.aws* configuration parameters in
>>>>> /root/spark/ephemeral-hdfs/conf/core-ste.xml?
>>>>>
>>>>> On Mon, Oct 13, 2014 at 11:03 AM, Ranga  wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am trying to access files/buckets in S3 and encountering a
>>>>>> permissions issue. The buckets are configured to authenticate using an
>>>>>> IAMRole provider.
>>>>>> I have set the KeyId and Secret using environment variables (
>>>>>> AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still
>>>>>> unable to access the S3 buckets.
>>>>>>
>>>>>> Before setting the access key and secret the error was: 
>>>>>> "java.lang.IllegalArgumentException:
>>>>>> AWS Access Key ID and Secret Access Key must be specified as the username
>>>>>> or password (respectively) of a s3n URL, or by setting the
>>>>>> fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
>>>>>> (respectively)."
>>>>>>
>>>>>> After setting the access key and secret, the error is: "The AWS
>>>>>> Access Key Id you provided does not exist in our records."
>>>>>>
>>>>>> The id/secret being set are the right values. This makes me believe
>>>>>> that something else ("token", etc.) needs to be set as well.
>>>>>> Any help is appreciated.
>>>>>>
>>>>>>
>>>>>> - Ranga
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: S3 Bucket Access

2014-10-13 Thread Ranga
Is there a way to specify a request header during the
.textFile call?


- Ranga

On Mon, Oct 13, 2014 at 11:03 AM, Ranga  wrote:

> Hi
>
> I am trying to access files/buckets in S3 and encountering a permissions
> issue. The buckets are configured to authenticate using an IAMRole provider.
> I have set the KeyId and Secret using environment variables (
> AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable
> to access the S3 buckets.
>
> Before setting the access key and secret the error was: 
> "java.lang.IllegalArgumentException:
> AWS Access Key ID and Secret Access Key must be specified as the username
> or password (respectively) of a s3n URL, or by setting the
> fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
> (respectively)."
>
> After setting the access key and secret, the error is: "The AWS Access
> Key Id you provided does not exist in our records."
>
> The id/secret being set are the right values. This makes me believe that
> something else ("token", etc.) needs to be set as well.
> Any help is appreciated.
>
>
> - Ranga
>


S3 Bucket Access

2014-10-13 Thread Ranga
Hi

I am trying to access files/buckets in S3 and encountering a permissions
issue. The buckets are configured to authenticate using an IAMRole provider.
I have set the KeyId and Secret using environment variables (
AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID). However, I am still unable to
access the S3 buckets.

Before setting the access key and secret the error was:
"java.lang.IllegalArgumentException:
AWS Access Key ID and Secret Access Key must be specified as the username
or password (respectively) of a s3n URL, or by setting the
fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties
(respectively)."

After setting the access key and secret, the error is: "The AWS Access Key
Id you provided does not exist in our records."

The id/secret being set are the right values. This makes me believe that
something else ("token", etc.) needs to be set as well.
Any help is appreciated.


- Ranga


Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-09 Thread Ranga
Resolution:
After realizing that the SerDe (OpenCSV) was causing all the fields to be
defined as "String" type, I modified the Hive "load" statement to use the
default serializer. I was able to modify the CSV input file to use a
different delimiter. Although, this is a workaround, I am able to proceed
with this for now.


- Ranga

On Wed, Oct 8, 2014 at 9:18 PM, Ranga  wrote:

> This is a bit strange. When I print the schema for the RDD, it reflects
> the correct data type for each column. But doing any kind of mathematical
> calculation seems to result in ClassCastException. Here is a sample that
> results in the exception:
> select c1, c2
> ...
> cast (c18 as int) * cast (c21 as int)
> ...
> from table
>
> Any other pointers? Thanks for the help.
>
>
> - Ranga
>
> On Wed, Oct 8, 2014 at 5:20 PM, Ranga  wrote:
>
>> Sorry. Its 1.1.0.
>> After digging a bit more into this, it seems like the OpenCSV Deseralizer
>> converts all the columns to a String type. This maybe throwing the
>> execution off. Planning to create a class and map the rows to this custom
>> class. Will keep this thread updated.
>>
>> On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust 
>> wrote:
>>
>>> Which version of Spark are you running?
>>>
>>> On Wed, Oct 8, 2014 at 4:18 PM, Ranga  wrote:
>>>
>>>> Thanks Michael. Should the cast be done in the source RDD or while
>>>> doing the SUM?
>>>> To give a better picture here is the code sequence:
>>>>
>>>> val sourceRdd = sql("select ... from source-hive-table")
>>>> sourceRdd.registerAsTable("sourceRDD")
>>>> val aggRdd = sql("select c1, c2, sum(c3) from sourceRDD group by c1,
>>>> c2)  // This query throws the exception when I collect the results
>>>>
>>>> I tried adding the cast to the aggRdd query above and that didn't help.
>>>>
>>>>
>>>> - Ranga
>>>>
>>>> On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> Using SUM on a string should automatically cast the column.  Also you
>>>>> can use CAST to change the datatype
>>>>> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions>
>>>>> .
>>>>>
>>>>> What version of Spark are you running?  This could be
>>>>> https://issues.apache.org/jira/browse/SPARK-1994
>>>>>
>>>>> On Wed, Oct 8, 2014 at 3:47 PM, Ranga  wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I am in the process of migrating some logic in pig scripts to
>>>>>> Spark-SQL. As part of this process, I am creating a few "Select...Group 
>>>>>> By"
>>>>>> query and registering them as tables using the SchemaRDD.registerAsTable
>>>>>> feature.
>>>>>> When using such a registered table in a subsequent "Select...Group
>>>>>> By" query, I get a "ClassCastException".
>>>>>> java.lang.ClassCastException: java.lang.String cannot be cast to
>>>>>> java.lang.Integer
>>>>>>
>>>>>> This happens when I use the "Sum" function on one of the columns. Is
>>>>>> there anyway to specify the data type for the columns when the
>>>>>> registerAsTable function is called? Are there other approaches that I
>>>>>> should be looking at?
>>>>>>
>>>>>> Thanks for your help.
>>>>>>
>>>>>>
>>>>>>
>>>>>> - Ranga
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
This is a bit strange. When I print the schema for the RDD, it reflects the
correct data type for each column. But doing any kind of mathematical
calculation seems to result in ClassCastException. Here is a sample that
results in the exception:
select c1, c2
...
cast (c18 as int) * cast (c21 as int)
...
from table

Any other pointers? Thanks for the help.


- Ranga

On Wed, Oct 8, 2014 at 5:20 PM, Ranga  wrote:

> Sorry. Its 1.1.0.
> After digging a bit more into this, it seems like the OpenCSV Deseralizer
> converts all the columns to a String type. This maybe throwing the
> execution off. Planning to create a class and map the rows to this custom
> class. Will keep this thread updated.
>
> On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust 
> wrote:
>
>> Which version of Spark are you running?
>>
>> On Wed, Oct 8, 2014 at 4:18 PM, Ranga  wrote:
>>
>>> Thanks Michael. Should the cast be done in the source RDD or while doing
>>> the SUM?
>>> To give a better picture here is the code sequence:
>>>
>>> val sourceRdd = sql("select ... from source-hive-table")
>>> sourceRdd.registerAsTable("sourceRDD")
>>> val aggRdd = sql("select c1, c2, sum(c3) from sourceRDD group by c1, c2)
>>>  // This query throws the exception when I collect the results
>>>
>>> I tried adding the cast to the aggRdd query above and that didn't help.
>>>
>>>
>>> - Ranga
>>>
>>> On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust >> > wrote:
>>>
>>>> Using SUM on a string should automatically cast the column.  Also you
>>>> can use CAST to change the datatype
>>>> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions>
>>>> .
>>>>
>>>> What version of Spark are you running?  This could be
>>>> https://issues.apache.org/jira/browse/SPARK-1994
>>>>
>>>> On Wed, Oct 8, 2014 at 3:47 PM, Ranga  wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> I am in the process of migrating some logic in pig scripts to
>>>>> Spark-SQL. As part of this process, I am creating a few "Select...Group 
>>>>> By"
>>>>> query and registering them as tables using the SchemaRDD.registerAsTable
>>>>> feature.
>>>>> When using such a registered table in a subsequent "Select...Group By"
>>>>> query, I get a "ClassCastException".
>>>>> java.lang.ClassCastException: java.lang.String cannot be cast to
>>>>> java.lang.Integer
>>>>>
>>>>> This happens when I use the "Sum" function on one of the columns. Is
>>>>> there anyway to specify the data type for the columns when the
>>>>> registerAsTable function is called? Are there other approaches that I
>>>>> should be looking at?
>>>>>
>>>>> Thanks for your help.
>>>>>
>>>>>
>>>>>
>>>>> - Ranga
>>>>>
>>>>
>>>>
>>>
>>
>


Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Sorry. Its 1.1.0.
After digging a bit more into this, it seems like the OpenCSV Deseralizer
converts all the columns to a String type. This maybe throwing the
execution off. Planning to create a class and map the rows to this custom
class. Will keep this thread updated.

On Wed, Oct 8, 2014 at 5:11 PM, Michael Armbrust 
wrote:

> Which version of Spark are you running?
>
> On Wed, Oct 8, 2014 at 4:18 PM, Ranga  wrote:
>
>> Thanks Michael. Should the cast be done in the source RDD or while doing
>> the SUM?
>> To give a better picture here is the code sequence:
>>
>> val sourceRdd = sql("select ... from source-hive-table")
>> sourceRdd.registerAsTable("sourceRDD")
>> val aggRdd = sql("select c1, c2, sum(c3) from sourceRDD group by c1, c2)
>>  // This query throws the exception when I collect the results
>>
>> I tried adding the cast to the aggRdd query above and that didn't help.
>>
>>
>> - Ranga
>>
>> On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust 
>> wrote:
>>
>>> Using SUM on a string should automatically cast the column.  Also you
>>> can use CAST to change the datatype
>>> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions>
>>> .
>>>
>>> What version of Spark are you running?  This could be
>>> https://issues.apache.org/jira/browse/SPARK-1994
>>>
>>> On Wed, Oct 8, 2014 at 3:47 PM, Ranga  wrote:
>>>
>>>> Hi
>>>>
>>>> I am in the process of migrating some logic in pig scripts to
>>>> Spark-SQL. As part of this process, I am creating a few "Select...Group By"
>>>> query and registering them as tables using the SchemaRDD.registerAsTable
>>>> feature.
>>>> When using such a registered table in a subsequent "Select...Group By"
>>>> query, I get a "ClassCastException".
>>>> java.lang.ClassCastException: java.lang.String cannot be cast to
>>>> java.lang.Integer
>>>>
>>>> This happens when I use the "Sum" function on one of the columns. Is
>>>> there anyway to specify the data type for the columns when the
>>>> registerAsTable function is called? Are there other approaches that I
>>>> should be looking at?
>>>>
>>>> Thanks for your help.
>>>>
>>>>
>>>>
>>>> - Ranga
>>>>
>>>
>>>
>>
>


Re: Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Thanks Michael. Should the cast be done in the source RDD or while doing
the SUM?
To give a better picture here is the code sequence:

val sourceRdd = sql("select ... from source-hive-table")
sourceRdd.registerAsTable("sourceRDD")
val aggRdd = sql("select c1, c2, sum(c3) from sourceRDD group by c1, c2)
 // This query throws the exception when I collect the results

I tried adding the cast to the aggRdd query above and that didn't help.


- Ranga

On Wed, Oct 8, 2014 at 3:52 PM, Michael Armbrust 
wrote:

> Using SUM on a string should automatically cast the column.  Also you can
> use CAST to change the datatype
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-TypeConversionFunctions>
> .
>
> What version of Spark are you running?  This could be
> https://issues.apache.org/jira/browse/SPARK-1994
>
> On Wed, Oct 8, 2014 at 3:47 PM, Ranga  wrote:
>
>> Hi
>>
>> I am in the process of migrating some logic in pig scripts to Spark-SQL.
>> As part of this process, I am creating a few "Select...Group By" query and
>> registering them as tables using the SchemaRDD.registerAsTable feature.
>> When using such a registered table in a subsequent "Select...Group By"
>> query, I get a "ClassCastException".
>> java.lang.ClassCastException: java.lang.String cannot be cast to
>> java.lang.Integer
>>
>> This happens when I use the "Sum" function on one of the columns. Is
>> there anyway to specify the data type for the columns when the
>> registerAsTable function is called? Are there other approaches that I
>> should be looking at?
>>
>> Thanks for your help.
>>
>>
>>
>> - Ranga
>>
>
>


Spark-SQL: SchemaRDD - ClassCastException

2014-10-08 Thread Ranga
Hi

I am in the process of migrating some logic in pig scripts to Spark-SQL. As
part of this process, I am creating a few "Select...Group By" query and
registering them as tables using the SchemaRDD.registerAsTable feature.
When using such a registered table in a subsequent "Select...Group By"
query, I get a "ClassCastException".
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer

This happens when I use the "Sum" function on one of the columns. Is there
anyway to specify the data type for the columns when the registerAsTable
function is called? Are there other approaches that I should be looking at?

Thanks for your help.



- Ranga