Re: [PSA] Python 2, 3.4 and 3.5 are now dropped

2020-07-13 Thread Hyukjin Kwon
cc user mailing list too.

2020년 7월 14일 (화) 오전 11:27, Hyukjin Kwon 님이 작성:

> I am sending another email to make sure dev people know. Python 2, 3.4 and
> 3.5 are now dropped at https://github.com/apache/spark/pull/28957.
>
>
>


PySpark documentation main page

2020-08-01 Thread Hyukjin Kwon
Hi all,

I am trying to write up the main page of PySpark documentation at
https://github.com/apache/spark/pull/29320.

While I think the current proposal might be good enough, I would like
to collect more feedback about the contents, structure and image since
this is the entrance page of PySpark documentation.

For example, sharing a reference site is also very welcome. Let me know
if any of you guys have a good idea to share. I plan to leave it open for
some
more days.

PS: thanks @Liang-Chi Hsieh  and @Sean Owen
 for taking a look at it quickly.


Re: Apache Spark 3.1 Preparation Status (Oct. 2020)

2020-10-03 Thread Hyukjin Kwon
Nice summary. Thanks Dongjoon. One minor correction -> I believe we dropped
R 3.5 and below at branch 2.4 as well.

On Sun, 4 Oct 2020, 09:17 Dongjoon Hyun,  wrote:

> Hi, All.
>
> As of today, master branch (Apache Spark 3.1.0) resolved
> 852+ JIRA issues and 606+ issues are 3.1.0-only patches.
> According to the 3.1.0 release window, branch-3.1 will be
> created on November 1st and enters QA period.
>
> Here are some notable updates I've been monitoring.
>
> *Language*
> 01. SPARK-25075 Support Scala 2.13
>   - Since SPARK-32926, Scala 2.13 build test has
> become a part of GitHub Action jobs.
>   - After SPARK-33044, Scala 2.13 test will be
> a part of Jenkins jobs.
> 02. SPARK-29909 Drop Python 2 and Python 3.4 and 3.5
> 03. SPARK-32082 Project Zen: Improving Python usability
>   - 7 of 16 issues are resolved.
> 04. SPARK-32073 Drop R < 3.5 support
>   - This is done for Spark 3.0.1 and 3.1.0.
>
> *Dependency*
> 05. SPARK-32058 Use Apache Hadoop 3.2.0 dependency
>   - This changes the default dist. for better cloud support
> 06. SPARK-32981 Remove hive-1.2 distribution
> 07. SPARK-20202 Remove references to org.spark-project.hive
>   - This will remove Hive 1.2.1 from source code
> 08. SPARK-29250 Upgrade to Hadoop 3.2.1 (WIP)
>
> *Core*
> 09. SPARK-27495 Support Stage level resource conf and scheduling
>   - 11 of 15 issues are resolved
> 10. SPARK-25299 Use remote storage for persisting shuffle data
>   - 8 of 14 issues are resolved
>
> *Resource Manager*
> 11. SPARK-33005 Kubernetes GA preparation
>   - It is on the way and we are waiting for more feedback.
>
> *SQL*
> 12. SPARK-30648/SPARK-32346 Support filters pushdown
>   to JSON/Avro
> 13. SPARK-32948/SPARK-32958 Add Json expression optimizer
> 14. SPARK-12312 Support JDBC Kerberos w/ keytab
>   - 11 of 17 issues are resolved
> 15. SPARK-27589 DSv2 was mostly completed in 3.0
>   and added more features in 3.1 but still we missed
>   - All built-in DataSource v2 write paths are disabled
> and v1 write is used instead.
>   - Support partition pruning with subqueries
>   - Support bucketing
>
> We still have one month before the feature freeze
> and starting QA. If you are working for 3.1,
> please consider the timeline and share your schedule
> with the Apache Spark community. For the other stuff,
> we can put it into 3.2 release scheduled in June 2021.
>
> Last not but least, I want to emphasize (7) once again.
> We need to remove the forked unofficial Hive eventually.
> Please let us know your reasons if you need to build
> from Apache Spark 3.1 source code for Hive 1.2.
>
> https://github.com/apache/spark/pull/29936
>
> As I wrote in the above PR description, for old releases,
> Apache Spark 2.4(LTS) and 3.0 (~2021.12) will provide
> Hive 1.2-based distribution.
>
> Bests,
> Dongjoon.
>


Re: [SparkR] gapply with strings with arrow

2020-10-10 Thread Hyukjin Kwon
If it works without Arrow optimization, it's likely a bug. Please feel free
to file a JIRA for that.

On Wed, 7 Oct 2020, 22:44 Jacek Pliszka,  wrote:

> Hi!
>
> Is there any place I can find information how to use gapply with arrow?
>
> I've tried something very simple
>
> collect(gapply(
>   df,
>   c("ColumnA"),
>   function(key, x){
>   data.frame(out=c("dfs"), stringAsFactors=FALSE)
>   },
>   "out String"
> ))
>
> But it fails - similar code with integers or double works fine.
>
> [Fetched stdout timeout] Error in readBin(con, raw(),
> as.integer(dataLen), endian = "big") : invalid 'n' argument
>
> java.lang.UnsupportedOperationException at
>
> org.apache.spark.sql.vectorized.ArrowColumnVector$ArrowVectorAccessor.getUTF8String(ArrowColumnVector.java:233)
> at
> org.apache.spark.sql.vectorized.ArrowColumnVector.getUTF8String(ArrowColumnVector.java:109)
> at
> org.apache.spark.sql.vectorized.ColumnarBatchRow.getUTF8String(ColumnarBatch.java:220)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
> Source)
>  ...
>
> When I looked at the source code there - it is all stubs.
>
> Is there a proper way to use arrow in gapply in SparkR?
>
> BR,
>
> Jacel
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Hyukjin Kwon
We are excited to announce Spark 3.1.1 today.

Apache Spark 3.1.1 is the second release of the 3.x line. This release adds
Python type annotations and Python dependency management support as part of
Project Zen.
Other major updates include improved ANSI SQL compliance support, history
server support
in structured streaming, the general availability (GA) of Kubernetes and
node decommissioning
in Kubernetes and Standalone. In addition, this release continues to focus
on usability, stability,
and polish while resolving around 1500 tickets.

We'd like to thank our contributors and users for their contributions and
early feedback to
this release. This release would not have been possible without you.

To download Spark 3.1.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-1.html


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-03 Thread Hyukjin Kwon
Thank you so much guys .. it indeed took a long time and it was pretty
tough this time :-).
It was all possible because of your guys' support. I sincerely appreciate
it 👍.

2021년 3월 4일 (목) 오전 2:26, Dongjoon Hyun 님이 작성:

> It took a long time. Thank you, Hyukjin and all!
>
> Bests,
> Dongjoon.
>
> On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
> wrote:
>
>> Good to hear and great work Hyukjin! 👏
>>
>> On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
>> wrote:
>>
>>> Thanks Hyukjin for driving the huge release, and thanks everyone for
>>> contributing the release!
>>>
>>> On Wed, Mar 3, 2021 at 6:54 PM angers zhu  wrote:
>>>
>>>> Great work, Hyukjin !
>>>>
>>>> Bests,
>>>> Angers
>>>>
>>>> Wenchen Fan  于2021年3月3日周三 下午5:02写道:
>>>>
>>>>> Great work and congrats!
>>>>>
>>>>> On Wed, Mar 3, 2021 at 3:51 PM Kent Yao  wrote:
>>>>>
>>>>>> Congrats, all!
>>>>>>
>>>>>> Bests,
>>>>>> *Kent Yao *
>>>>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>>>>> *a spark enthusiast*
>>>>>> *kyuubi <https://github.com/yaooqinn/kyuubi>is a
>>>>>> unified multi-tenant JDBC interface for large-scale data processing and
>>>>>> analytics, built on top of Apache Spark <http://spark.apache.org/>.*
>>>>>> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A
>>>>>> Spark SQL extension which provides SQL Standard Authorization for 
>>>>>> **Apache
>>>>>> Spark <http://spark.apache.org/>.*
>>>>>> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A
>>>>>> library for reading data from and transferring data to Postgres / 
>>>>>> Greenplum
>>>>>> with Spark SQL and DataFrames, 10~100x faster.*
>>>>>> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A
>>>>>> library that brings excellent and useful functions from various modern
>>>>>> database management systems to Apache Spark <http://spark.apache.org/>.*
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 03/3/2021 15:11,Takeshi Yamamuro
>>>>>>  wrote:
>>>>>>
>>>>>> Great work and Congrats, all!
>>>>>>
>>>>>> Bests,
>>>>>> Takeshi
>>>>>>
>>>>>> On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> Thanks Hyukjin and congratulations everyone on the release !
>>>>>>>
>>>>>>> Regards,
>>>>>>> Mridul
>>>>>>>
>>>>>>> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>>>>>>>
>>>>>>>> Great work, Hyukjin!
>>>>>>>>
>>>>>>>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> We are excited to announce Spark 3.1.1 today.
>>>>>>>>>
>>>>>>>>> Apache Spark 3.1.1 is the second release of the 3.x line. This
>>>>>>>>> release adds
>>>>>>>>> Python type annotations and Python dependency management support
>>>>>>>>> as part of Project Zen.
>>>>>>>>> Other major updates include improved ANSI SQL compliance support,
>>>>>>>>> history server support
>>>>>>>>> in structured streaming, the general availability (GA) of
>>>>>>>>> Kubernetes and node decommissioning
>>>>>>>>> in Kubernetes and Standalone. In addition, this release continues
>>>>>>>>> to focus on usability, stability,
>>>>>>>>> and polish while resolving around 1500 tickets.
>>>>>>>>>
>>>>>>>>> We'd like to thank our contributors and users for their
>>>>>>>>> contributions and early feedback to
>>>>>>>>> this release. This release would not have been possible without
>>>>>>>>> you.
>>>>>>>>>
>>>>>>>>> To download Spark 3.1.1, head over to the download page:
>>>>>>>>> http://spark.apache.org/downloads.html
>>>>>>>>>
>>>>>>>>> To view the release notes:
>>>>>>>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> ---
>>>>>> Takeshi Yamamuro
>>>>>>
>>>>>>


Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome!

2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:

> We are happy to announce the availability of Spark 3.1.2!
>
> Spark 3.1.2 is a maintenance release containing stability fixes. This
> release is based on the branch-3.1 maintenance branch of Spark. We strongly
> recommend all 3.1 users to upgrade to this stable release.
>
> To download Spark 3.1.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean.

Yes, there's an optimization on DataFrame.collect which tries to collect
few first partitioms and see if the number of rows are found (and repeat).

DataFrame.toPandas does not have such optimization.

I suspect that the shuffle isn't actual shuffle but just collects local
limits on executors to one executor to calculate global limit.

On Fri, Nov 12, 2021 at 11:16 PM Sean Owen  wrote:

> Hyukjin can you weigh in?
> Is this exchange due to something in your operations or clearly unique to
> the toPandas operation?
> I didn't think it worked that way, but maybe there is some good reason it
> does.
>
> On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev <
> sergeyivanyc...@gmail.com> wrote:
>
>> Hi Sean,
>>
>> According to the plan I’m observing, this is what happens indeed. There’s
>> exchange operation that sends data to a single partition/task in toPandas()
>> + PyArrow enabled case.
>>
>> 12 нояб. 2021 г., в 16:31, Sean Owen  написал(а):
>>
>> Yes, none of the responses are addressing your question.
>> I do not think it's a bug necessarily; do you end up with one partition
>> in your execution somewhere?
>>
>> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <
>> sergeyivanyc...@gmail.com> wrote:
>>
>>> Of course if I give 64G of ram to each executor they will work. But
>>> what’s the point? Collecting results in the driver should cause a high RAM
>>> usage in the driver and that’s what is happening in collect() case. In the
>>> case where pyarrow serialization is enabled all the data is being collected
>>> on a single executor, which is clearly a wrong way to collect the result on
>>> the driver.
>>>
>>> I guess I’ll open an issue about it in Spark Jira. It clearly looks like
>>> a bug.
>>>
>>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh 
>>> написал(а):
>>>
>>> OK, your findings do not imply those settings are incorrect. Those
>>> settings will work if you set-up your k8s cluster in peer-to-peer mode with
>>> equal amounts of RAM for each node which is common practice.
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <
>>> sergeyivanyc...@gmail.com> wrote:
>>>
 Yes, in fact those are the settings that cause this behaviour. If set
 to false, everything goes fine since the implementation in spark sources in
 this case is

 pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)

 Best regards,


 Sergey Ivanychev

 11 нояб. 2021 г., в 13:58, Mich Talebzadeh 
 написал(а):

 
 Have you tried the following settings:

 spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")

 spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")

 HTH

view my Linkedin profile
 


 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.




 On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh 
 wrote:

> Ok so it boils down on how spark does create toPandas() DF under the
> bonnet. How many executors are involved in k8s cluster. In this model 
> spark
> will create executors = no of nodes - 1
>
> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <
> sergeyivanyc...@gmail.com> wrote:
>
>> > Just to confirm with Collect() alone, this is all on the driver?
>>
>> I shared the screenshot with the plan in the first email. In the
>> collect() case the data gets fetched to the driver without problems.
>>
>> Best regards,
>>
>>
>> Sergey Ivanychev
>>
>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh 
>> написал(а):
>>
>> Just to confirm with Collect() alone, this is all on the driver?
>>
>> --
>
>
>view my Linkedin profile
> 
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damage

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
To add some more, if the number of rows to collect is large,
DataFrame.collect can be slower because it launches multiple Spark jobs
sequentially.

Given that DataFrame.toPandas does not take how many rows to collect, it's
controversial to apply the same optimization of DataFrame.collect to here.

We could have a configuration to enable and disable but the implementation
of this in DataFrame.toPandas would be complicated due to existing
optimization such as Arrow. Haven't taken a deeper look though but my guts
say it's not worthwhile.

On Sat, Nov 13, 2021 at 12:05 PM Hyukjin Kwon  wrote:

> Thanks for pinging me Sean.
>
> Yes, there's an optimization on DataFrame.collect which tries to collect
> few first partitioms and see if the number of rows are found (and repeat).
>
> DataFrame.toPandas does not have such optimization.
>
> I suspect that the shuffle isn't actual shuffle but just collects local
> limits on executors to one executor to calculate global limit.
>
> On Fri, Nov 12, 2021 at 11:16 PM Sean Owen  wrote:
>
>> Hyukjin can you weigh in?
>> Is this exchange due to something in your operations or clearly unique to
>> the toPandas operation?
>> I didn't think it worked that way, but maybe there is some good reason it
>> does.
>>
>> On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev <
>> sergeyivanyc...@gmail.com> wrote:
>>
>>> Hi Sean,
>>>
>>> According to the plan I’m observing, this is what happens indeed.
>>> There’s exchange operation that sends data to a single partition/task in
>>> toPandas() + PyArrow enabled case.
>>>
>>> 12 нояб. 2021 г., в 16:31, Sean Owen  написал(а):
>>>
>>> Yes, none of the responses are addressing your question.
>>> I do not think it's a bug necessarily; do you end up with one partition
>>> in your execution somewhere?
>>>
>>> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <
>>> sergeyivanyc...@gmail.com> wrote:
>>>
>>>> Of course if I give 64G of ram to each executor they will work. But
>>>> what’s the point? Collecting results in the driver should cause a high RAM
>>>> usage in the driver and that’s what is happening in collect() case. In the
>>>> case where pyarrow serialization is enabled all the data is being collected
>>>> on a single executor, which is clearly a wrong way to collect the result on
>>>> the driver.
>>>>
>>>> I guess I’ll open an issue about it in Spark Jira. It clearly looks
>>>> like a bug.
>>>>
>>>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh 
>>>> написал(а):
>>>>
>>>> OK, your findings do not imply those settings are incorrect. Those
>>>> settings will work if you set-up your k8s cluster in peer-to-peer mode with
>>>> equal amounts of RAM for each node which is common practice.
>>>>
>>>> HTH
>>>>
>>>>
>>>>view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <
>>>> sergeyivanyc...@gmail.com> wrote:
>>>>
>>>>> Yes, in fact those are the settings that cause this behaviour. If set
>>>>> to false, everything goes fine since the implementation in spark sources 
>>>>> in
>>>>> this case is
>>>>>
>>>>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Sergey Ivanychev
>>>>>
>>>>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh 
>>>>> написал(а):
>>>>>
>>>>> 
>>>>> Have you tried the following settings:
>>>>>
>>>>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>>>>
>>>>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")

Re: [R] SparkR on conda-forge

2021-12-19 Thread Hyukjin Kwon
Awesome!

On Mon, 20 Dec 2021 at 09:43, yonghua  wrote:

> Nice release. thanks for sharing.
>
> On 2021/12/20 3:55, Maciej wrote:
> > FYI ‒ thanks to good folks from conda-forge we have now these:
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There
are integration test cases for K8S support, and I myself also tested it
before.
It would be helpful if you try what I did at
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
and see if it works.

On Mon, 6 Dec 2021 at 17:22, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Hi Mich,
>
>
>
> Thanks for your response. Yes –py-files options works. I also tested it.
>
> The question is why the –archives option doesn’t?
>
>
>
> From Jira I can see that it should be available since 3.1.0:
>
>
>
> https://issues.apache.org/jira/browse/SPARK-33530
>
> https://issues.apache.org/jira/browse/SPARK-33615
>
>
>
> Best,
>
> Meikel
>
>
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Samstag, 4. Dezember 2021 18:36
> *To:* Bode, Meikel, NMA-CFD 
> *Cc:* dev ; user@spark.apache.org
> *Subject:* Re: Conda Python Env in K8S
>
>
>
>
> Hi Meikel
>
>
>
> In the past I tried with
>
>
>
>--py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>
>--archives
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \
>
>
>
> which is basically what you are doing. the first line --py-files works but
> the second one fails
>
>
>
> It tried to unpack them ? It tries to unpack them
>
>
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.zip#pyspark_venv
> 
>  from
> /tmp/spark-502a5b57-0fe6-45bd-867d-9738e678e9a3/pyspark_venv.zip to
> /opt/spark/work-dir/./pyspark_venv
>
>
>
> But it failed.
>
>
>
> This could be due to creating the virtual environment inside the docker in
> the work-dir *o*r sometimes when there is not enough available memory to
> gunzip and untar the file, especially if your executors are built on
> cluster nodes with less memory than the driver node.
>
>
>
> However, The most convenient way to add additional packages to the docker
> image is to add them directly to the docker image at time of creating the
> image. So external packages are bundled as a part of my docker image
> because it is fixed and if an application requires those set of
> dependencies every time, they are there. Also note that every time you put
> RUN statement it creates an intermediate container and hence it increases
> build time. So reduce it as follows
>
> RUN pip install pyyaml numpy cx_Oracle --no-cache-dir
>
> The --no-cheche-dir option to pip is to prevent the downloaded binaries from 
> being added to the image, reducing the image size. It is also advisable to 
> install all packages in one line. Every time you put RUN statement it creates 
> an intermediate container and hence it increases the build time. So reduce it 
> by putting all packages in one line.
>
> Log in to the docker image and check for Python packages installed
>
> docker run -u 0 -it 
> spark/spark-py:3.1.1-scala_2.12-8-jre-slim-buster_java8PlusPackages bash
>
> root@5bc049af7278:/opt/spark/work-dir# pip list
>
> PackageVersion
>
> -- ---
>
> cx-Oracle  8.3.0
>
> numpy  1.21.4
>
> pip21.3.1
>
> PyYAML 6.0
>
> setuptools 59.4.0
>
> wheel  0.34.2
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sat, 4 Dec 2021 at 07:52, Bode, Meikel, NMA-CFD <
> meikel.b...@bertelsmann.de> wrote:
>
> Hi Mich,
>
>
>
> sure thats possible. But distributing the complete env would be more
> practical.
>
> A workaround at the moment is, that we build different environments and
> store them in a pv and then we mount it into the pods and refer from the
> SparkApplication resource to the desired env..
>
>
>
> But actually these options exist and I want to understand what the issue
> is…
>
> Any hints on that?
>
>
>
> Best,
>
> Meikel
>
>
>
> *

Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo

On Tue, 14 Jun 2022 at 15:04, Xiao Li  wrote:

> Hi, all,
>
> The ASF has an official store at RedBubble
>  that Apache Community
> Development (ComDev) runs. If you are interested in buying Spark Swag, 70
> products featuring the Spark logo are available:
> https://www.redbubble.com/shop/ap/113203780
>
> Go Spark!
>
> Xiao
>


Re: [Feature Request] make unix_micros() and unix_millis() available in PySpark (pyspark.sql.functions)

2022-10-16 Thread Hyukjin Kwon
You can workaround it by leveraging expr, e.g., expr("unix_micros(col)")
for now.
Should better have Scala binding first before we have Python one FWIW,

On Sat, 15 Oct 2022 at 06:19, Martin  wrote:

> Hi everyone,
>
> In *Spark SQL* there are several timestamp related functions
>
>- unix_micros(timestamp)
>Returns the number of microseconds since 1970-01-01 00:00:00 UTC.
>- unix_millis(timestamp)
>Returns the number of milliseconds since 1970-01-01 00:00:00 UTC.
>Truncates higher levels of precision.
>
> See https://spark.apache.org/docs/latest/sql-ref-functions-builtin.html
>
> Currently these are *"missing" in pyspark.sql.functions*.
>
> https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html#datetime-functions
>
> I'd appreciate it if these were also available in PySpark.
>
> Cheers,
> Martin
>


Re: [ANNOUNCE] Apache Spark 3.3.1 released

2022-10-26 Thread Hyukjin Kwon
Thanks, Yuming.

On Wed, 26 Oct 2022 at 16:01, L. C. Hsieh  wrote:

> Thank you for driving the release of Apache Spark 3.3.1, Yuming!
>
> On Tue, Oct 25, 2022 at 11:38 PM Dongjoon Hyun 
> wrote:
> >
> > It's great. Thank you so much, Yuming!
> >
> > Dongjoon
> >
> > On Tue, Oct 25, 2022 at 11:23 PM Yuming Wang  wrote:
> >>
> >> We are happy to announce the availability of Apache Spark 3.3.1!
> >>
> >> Spark 3.3.1 is a maintenance release containing stability fixes. This
> >> release is based on the branch-3.3 maintenance branch of Spark. We
> strongly
> >> recommend all 3.3 users to upgrade to this stable release.
> >>
> >> To download Spark 3.3.1, head over to the download page:
> >> https://spark.apache.org/downloads.html
> >>
> >> To view the release notes:
> >> https://spark.apache.org/releases/spark-release-3-3-1.html
> >>
> >> We would like to acknowledge all community members for contributing to
> this
> >> release. This release would not have been possible without you.
> >>
> >>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Slack for PySpark users

2023-03-27 Thread Hyukjin Kwon
Yeah, actually I think we should better have a slack channel so we can
easily discuss with users and developers.

On Tue, 28 Mar 2023 at 03:08, keen  wrote:

> Hi all,
> I really like *Slack *as communication channel for a tech community.
> There is a Slack workspace for *delta lake users* (
> https://go.delta.io/slack) that I enjoy a lot.
> I was wondering if there is something similar for PySpark users.
>
> If not, would there be anything wrong with creating a new Slack workspace
> for PySpark users? (when explicitly mentioning that this is *not*
> officially part of Apache Spark)?
>
> Cheers
> Martin
>


Re: [ANNOUNCE] Apache Spark 3.4.1 released

2023-06-23 Thread Hyukjin Kwon
Thanks!

On Sat, Jun 24, 2023 at 11:01 AM Mridul Muralidharan 
wrote:

>
> Thanks Dongjoon !
>
> Regards,
> Mridul
>
> On Fri, Jun 23, 2023 at 6:58 PM Dongjoon Hyun  wrote:
>
>> We are happy to announce the availability of Apache Spark 3.4.1!
>>
>> Spark 3.4.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.4 maintenance branch of Spark. We
>> strongly
>> recommend all 3.4 users to upgrade to this stable release.
>>
>> To download Spark 3.4.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-4-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>>
>> Dongjoon Hyun
>>
>


Re: Introducing English SDK for Apache Spark - Seeking Your Feedback and Contributions

2023-07-03 Thread Hyukjin Kwon
The demo was really amazing.

On Tue, 4 Jul 2023 at 09:17, Farshid Ashouri 
wrote:

> This is wonderful news!
>
> On Tue, 4 Jul 2023 at 01:14, Gengliang Wang  wrote:
>
>> Dear Apache Spark community,
>>
>> We are delighted to announce the launch of a groundbreaking tool that
>> aims to make Apache Spark more user-friendly and accessible - the
>> English SDK . Powered by
>> the application of Generative AI, the English SDK
>>  allows you to execute
>> complex tasks with simple English instructions. This exciting news was 
>> announced
>> recently at the Data+AI Summit
>>  and also introduced
>> through a detailed blog post
>> 
>> .
>>
>> Now, we need your invaluable feedback and contributions. The aim of the
>> English SDK is not only to simplify and enrich your Apache Spark experience
>> but also to grow with the community. We're calling upon Spark developers
>> and users to explore this innovative tool, offer your insights, provide
>> feedback, and contribute to its evolution.
>>
>> You can find more details about the SDK and usage examples on the GitHub
>> repository https://github.com/databrickslabs/pyspark-ai/. If you have
>> any feedback or suggestions, please feel free to open an issue directly on
>> the repository. We are actively monitoring the issues and value your
>> insights.
>>
>> We also welcome pull requests and are eager to see how you might extend
>> or refine this tool. Let's come together to continue making Apache Spark
>> more approachable and user-friendly.
>>
>> Thank you in advance for your attention and involvement. We look forward
>> to hearing your thoughts and seeing your contributions!
>>
>> Best,
>> Gengliang Wang
>>
> --
>
>
> *Farshid Ashouri*,
> Senior Vice President,
> J.P. Morgan & Chase Co.
> +44 7932 650 788
>
>


Re: [FYI] SPARK-45981: Improve Python language test coverage

2023-12-02 Thread Hyukjin Kwon
Awesome!

On Sat, Dec 2, 2023 at 2:33 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> As a part of Apache Spark 4.0.0 (SPARK-44111), the Apache Spark community
> starts to have test coverage for all supported Python versions from Today.
>
> - https://github.com/apache/spark/actions/runs/7061665420
>
> Here is a summary.
>
> 1. Main CI: All PRs and commits on `master` branch are tested with Python
> 3.9.
> 2. Daily CI:
> https://github.com/apache/spark/actions/workflows/build_python.yml
> - PyPy 3.8
> - Python 3.10
> - Python 3.11
> - Python 3.12
>
> This is a great addition for PySpark 4.0+ users and an extensible
> framework for all future Python versions.
>
> Thank you all for making this together!
>
> Best,
> Dongjoon.
>


Re: Architecture of Spark Connect

2023-12-14 Thread Hyukjin Kwon
By default for now, yes. One Spark Connect server handles multiple Spark
Sessions. To multiplex or run multiple Drivers, you need some work such as
gateway.

On Thu, 14 Dec 2023 at 12:03, Kezhi Xiong  wrote:

> Hi,
>
> My understanding is there is only one driver/spark context for all user
> sessions. When you run the bin/start-connect-server script, you are
> submitting one long standing spark job / application. Every time a new user
> request comes in, a new user session is created under that. Please correct
> me if I am wrong.
>
> Kezhi
>
> On Thu, Dec 14, 2023 at 10:35 AM Nikhil Goyal  wrote:
>
>> [ External sender. Exercise caution. ]
>>
>> If multiple applications are running, we would need multiple spark
>> connect servers? If so, is the user responsible for creating these servers
>> or they are just created on the fly when the user requests a new spark
>> session?
>>
>> On Thu, Dec 14, 2023 at 10:28 AM Nikhil Goyal 
>> wrote:
>>
>>> Hi folks,
>>> I am trying to understand one question. Does Spark Connect create a new
>>> driver in the backend for every user or there are a fixed number of drivers
>>> running to which requests are sent to?
>>>
>>> Thanks
>>> Nikhil
>>>
>>


Re: Pyspark UDF as a data source for streaming

2023-12-28 Thread Hyukjin Kwon
Just fyi streaming python data source is in progress
https://github.com/apache/spark/pull/44416 we will likely release this in
spark 4.0

On Thu, Dec 28, 2023 at 4:53 PM Поротиков Станислав Вячеславович
 wrote:

> Yes, it's actual data.
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 9:43 PM
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Is this generated data actual data or you are testing the application?
>
>
>
> Sounds like a form of Lambda architecture here with some
> decision/processing not far from the attached diagram
>
>
>
> HTH
>
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 13:26, Поротиков Станислав Вячеславович <
> s.poroti...@skbkontur.ru> wrote:
>
> Actually it's json with specific structure from API server.
>
> But the task is to check constantly if new data appears on API server and
> load it to Kafka.
>
> Full pipeline can be presented like that:
>
> REST API -> Kafka -> some processing -> Kafka/Mongo -> …
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, December 27, 2023 6:17 PM
> *To:* Поротиков Станислав Вячеславович 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Pyspark UDF as a data source for streaming
>
>
>
> Ok so you want to generate some random data and load it into Kafka on a
> regular interval and the rest?
>
>
>
> HTH
>
> Mich Talebzadeh,
>
> Dad | Technologist | Solutions Architect | Engineer
>
> London
>
> United Kingdom
>
>
>
>  [image: Рисунок удален отправителем.]  view my Linkedin profile
> 
>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 27 Dec 2023 at 12:16, Поротиков Станислав Вячеславович
>  wrote:
>
> Hello!
>
> Is it possible to write pyspark UDF, generated data to streaming dataframe?
>
> I want to get some data from REST API requests in real time and consider
> to save this data to dataframe.
>
> And then put it to Kafka.
>
> I can't realise how to create streaming dataframe from generated data.
>
>
>
> I am new in spark streaming.
>
> Could you give me some hints?
>
>
>
> Best regards,
>
> Stanislav Porotikov
>
>
>
>


Re: [ANNOUNCE] Apache Spark 3.5.1 released

2024-03-04 Thread Hyukjin Kwon
Is this related to https://github.com/apache/spark/pull/42428?

cc @Yang,Jie(INF) 

On Mon, 4 Mar 2024 at 22:21, Jungtaek Lim 
wrote:

> Shall we revisit this functionality? The API doc is built with individual
> versions, and for each individual version we depend on other released
> versions. This does not seem to be right to me. Also, the functionality is
> only in PySpark API doc which does not seem to be consistent as well.
>
> I don't think this is manageable with the current approach (listing
> versions in version-dependent doc). Let's say we release 3.4.3 after 3.5.1.
> Should we update the versions in 3.5.1 to add 3.4.3 in version switcher?
> How about the time we are going to release the new version after releasing
> 10 versions? What's the criteria of pruning the version?
>
> Unless we have a good answer to these questions, I think it's better to
> revert the functionality - it missed various considerations.
>
> On Fri, Mar 1, 2024 at 2:44 PM Jungtaek Lim 
> wrote:
>
>> Thanks for reporting - this is odd - the dropdown did not exist in other
>> recent releases.
>>
>> https://spark.apache.org/docs/3.5.0/api/python/index.html
>> https://spark.apache.org/docs/3.4.2/api/python/index.html
>> https://spark.apache.org/docs/3.3.4/api/python/index.html
>>
>> Looks like the dropdown feature was recently introduced but partially
>> done. The addition of a dropdown was done, but the way how to bump the
>> version was missed to be documented.
>> The contributor proposed the way to update the version "automatically",
>> but the PR wasn't merged. As a result, we are neither having the
>> instruction how to bump the version manually, nor having the automatic bump.
>>
>> * PR for addition of dropdown: https://github.com/apache/spark/pull/42428
>> * PR for automatically bumping version:
>> https://github.com/apache/spark/pull/42881
>>
>> We will probably need to add an instruction in the release process to
>> update the version. (For automatic bumping I don't have a good idea.)
>> I'll look into it. Please expect some delay during the holiday weekend
>> in S. Korea.
>>
>> Thanks again.
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Fri, Mar 1, 2024 at 2:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> BTW, Jungtaek.
>>>
>>> PySpark document seems to show a wrong branch. At this time, `master`.
>>>
>>> https://spark.apache.org/docs/3.5.1/api/python/index.html
>>>
>>> PySpark Overview
>>> 
>>>
>>>Date: Feb 24, 2024 Version: master
>>>
>>> [image: Screenshot 2024-02-29 at 21.12.24.png]
>>>
>>>
>>> Could you do the follow-up, please?
>>>
>>> Thank you in advance.
>>>
>>> Dongjoon.
>>>
>>>
>>> On Thu, Feb 29, 2024 at 2:48 PM John Zhuge  wrote:
>>>
 Excellent work, congratulations!

 On Wed, Feb 28, 2024 at 10:12 PM Dongjoon Hyun 
 wrote:

> Congratulations!
>
> Bests,
> Dongjoon.
>
> On Wed, Feb 28, 2024 at 11:43 AM beliefer  wrote:
>
>> Congratulations!
>>
>>
>>
>> At 2024-02-28 17:43:25, "Jungtaek Lim" 
>> wrote:
>>
>> Hi everyone,
>>
>> We are happy to announce the availability of Spark 3.5.1!
>>
>> Spark 3.5.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.5 maintenance branch of Spark. We
>> strongly
>> recommend all 3.5 users to upgrade to this stable release.
>>
>> To download Spark 3.5.1, head over to the download page:
>> https://spark.apache.org/downloads.html
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-5-1.html
>>
>> We would like to acknowledge all community members for contributing
>> to this
>> release. This release would not have been possible without you.
>>
>> Jungtaek Lim
>>
>> ps. Yikun is helping us through releasing the official docker image
>> for Spark 3.5.1 (Thanks Yikun!) It may take some time to be generally
>> available.
>>
>>

 --
 John Zhuge

>>>


Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon
One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> d...@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>


Documentation for "hidden" RESTful API for submitting jobs (not history server)

2016-03-14 Thread Hyukjin Kwon
Hi all,


While googling Spark, I accidentally found a RESTful API existing in Spark
for submitting jobs.

The link is here, http://arturmkrtchyan.com/apache-spark-hidden-rest-api

As Josh said, I can see the history of this RESTful API,
https://issues.apache.org/jira/browse/SPARK-5388 and also good
documentation here, as a PDF,
https://issues.apache.org/jira/secure/attachment/12696651/stable-spark-submit-in-standalone-mode-2-4-15.pdf
.
​

My question is, I cannot find anything about this except for history stuff
in the Spark web site.

I tried to search JIRAs but also could not find anything related with the
documentation for this.

Would it be great if users know how to use this?

Is this already concerned? or is there some documents for this in the
web-site?

Please give me some feedback.


Thanks!


Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Hyukjin Kwon
Hi,

For RESTful API for submitting an application, please take a look at this
link.

http://arturmkrtchyan.com/apache-spark-hidden-rest-api
On 26 Mar 2016 12:07 p.m., "vetal king"  wrote:

> Prateek
>
> It's possible to submit spark application from outside application. If you
> are using java then use ProcessBuilder and execute sparksubmit.
>
> There are two other options which i have not used. There is some spark
> submit server and spark also provides REST api to submit job but i don't
> have much information on it.
> On 25 Mar 2016 11:09 pm, "prateek arora" 
> wrote:
>
>> Hi
>>
>> I want to submit spark application from outside of spark clusters .   so
>> please help me to provide a information regarding this.
>>
>> Regards
>> Prateek
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/is-there-any-way-to-submit-spark-application-from-outside-of-spark-cluster-tp26599.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: Databricks fails to read the csv file with blank line at the file header

2016-03-28 Thread Hyukjin Kwon
Could I ask which version are you using?

It looks the cause is the empty line right after header (because that case
is not being checked in tests).

However, for empty lines before the header or inside date, they are being
tested.

https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/ages.csv

https://raw.githubusercontent.com/databricks/spark-csv/master/src/test/resources/cars.csv

So I think it might have to be able to read that case as well and this
might be an issue.

It can be simply done by ETL but I think the behaviour might have to be
consistent.

Maybe would this be better if this issue is open and discussed?
On 29 Mar 2016 6:54 a.m., "Ashok Kumar" 
wrote:

> Thanks a ton sir. Very helpful
>
>
> On Monday, 28 March 2016, 22:36, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Pretty straight forward
>
> #!/bin/ksh
> DIR="hdfs://:9000/data/stg/accounts/nw/x"
> #
> ## Remove the blank header line from the spreadsheets and compress them
> #
> echo `date` " ""===  Started Removing blank header line and
> Compressing all csv FILEs"
> for FILE in `ls *.csv`
> do
>   sed '1d' ${FILE} > ${FILE}.tmp
>   mv -f ${FILE}.tmp ${FILE}
>   /usr/bin/bzip2 ${FILE}
> done
> #
> ## Clear out hdfs staging directory
> #
> echo `date` " ""===  Started deleting old files from hdfs staging
> directory ${DIR}"
> hdfs dfs -rm -r ${DIR}/*.bz2
> echo `date` " ""===  Started Putting bz2 fileS to hdfs staging
> directory ${DIR}"
> for FILE in `ls *.bz2`
> do
>   hdfs dfs -copyFromLocal ${FILE} ${DIR}
> done
> echo `date` " ""===  Checking that all files are moved to hdfs staging
> directory"
> hdfs dfs -ls ${DIR}
> exit 0
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 28 March 2016 at 22:24, Ashok Kumar  wrote:
>
> Hello Mich
>
> If you accommodate can you please share your approach to steps 1-3 above.
>
> Best regards
>
>
> On Sunday, 27 March 2016, 14:53, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>
> Pretty simple as usual it is a combination of ETL and ELT.
>
> Basically csv files are loaded into staging directory on host, compressed
> before pushing into hdfs
>
>
>1. ETL --> Get rid of the header blank line on the csv files
>2. ETL --> Compress the csv files
>3. ETL --> Put the compressed CVF files  into hdfs staging directory
>4. ELT --> Use databricks to load the csv files
>5. ELT --> Spark FP to prcess the csv data
>6. ELT --> register it as a temporary table
>7. ELT --> Create an ORC table in a named database in compressed zlib2
>format in Hive database
>8. ELT --> Insert/select from temporary table to Hive table
>
>
> So the data is stored in an ORC table and one can do whatever analysis
> using Spark, Hive etc
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
> http://talebzadehmich.wordpress.com
>
>
> On 27 March 2016 at 03:05, Koert Kuipers  wrote:
>
> To me this is expected behavior that I would not want fixed, but if you
> look at the recent commits for spark-csv it has one that deals this...
> On Mar 26, 2016 21:25, "Mich Talebzadeh" 
> wrote:
>
>
> Hi,
>
> I have a standard csv file (saved as csv in HDFS) that has first line of
> blank at the header
> as follows
>
> [blank line]
> Date, Type, Description, Value, Balance, Account Name, Account Number
> [blank line]
> 22/03/2011,SBT,"'FUNDS TRANSFER , FROM A/C 1790999",200.00,200.00,"'BROWN
> AE","'638585-60125663",
>
> When I read this file using the following standard
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> it crashes.
>
> java.util.NoSuchElementException
> at java.util.ArrayList$Itr.next(ArrayList.java:794)
>
>  If I go and manually delete the first blank line it works OK
>
> val df =
> sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header",
> "true").load("hdfs://rhes564:9000/data/stg/accounts/ac/")
>
> df: org.apache.spark.sql.DataFrame = [Date: string,  Type: string,
> Description: string,  Value: double,  Balance: double,  Account Name:
> string,  Account Number: string]
>
> I can easily write a shell script to get rid of blank line. I was
> wondering if databricks does have a flag to get rid of the first blank line
> in csv file format?
>
> P.S. If the file is stored as DOS text file, this problem goes away.
>
> Thanks
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> 

Re: Null pointer exception when using com.databricks.spark.csv

2016-03-29 Thread Hyukjin Kwon
Hi,

I guess this is not a CSV-datasource specific problem.

Does loading any file (eg. textFile()) work as well?

I think this is related with this thread,
http://apache-spark-user-list.1001560.n3.nabble.com/Error-while-running-example-scala-application-using-spark-submit-td10056.html
.


2016-03-30 12:44 GMT+09:00 Selvam Raman :

> Hi,
>
> i am using spark 1.6.0 prebuilt hadoop 2.6.0 version in my windows machine.
>
> i was trying to use databricks csv format to read csv file. i used the
> below command.
>
> [image: Inline image 1]
>
> I got null pointer exception. Any help would be greatly appreciated.
>
> [image: Inline image 2]
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Spark/Parquet

2016-04-14 Thread Hyukjin Kwon
Currently Spark uses Parquet 1.7.0 (parquet-mr).

If you meant writer version2 (parquet-format), you can specify this by
manually setting as below:

sparkContext.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_2_0.toString)


2016-04-15 2:21 GMT+09:00 Younes Naguib :

> Hi all,
>
>
>
> When parquet 2.0 planned in Spark?
>
> Or is it already?
>
>
>
>
>
> *Younes Naguib*
>
> Triton Digital | 1440 Ste-Catherine W., Suite 1200 | Montreal, QC  H3G 1R8
>
> Tel.: +1 514 448 4037 x2688 | Tel.: +1 866 448 4037 x2688 | younes.naguib
> @tritondigital.com 
>
>
>


Re: Spark sql not pushing down timestamp range queries

2016-04-14 Thread Hyukjin Kwon
Hi,


String comparison itself is pushed down fine but the problem is to deal
with Cast.


It was pushed down before but is was reverted, (
https://github.com/apache/spark/pull/8049).

Several fixes were tried here, https://github.com/apache/spark/pull/11005
and etc. but there were no changes to make it.


To cut it short, it is not being pushed down because it is unsafe to
resolve cast (eg. long to integer)

For an workaround,  the implementation of Solr data source should be
changed to one with CatalystScan, which take all the filters.

But CatalystScan is not designed to be binary compatible across releases,
however it looks some think it is stable now, as mentioned here,
https://github.com/apache/spark/pull/10750#issuecomment-175400704.


Thanks!


2016-04-15 3:30 GMT+09:00 Mich Talebzadeh :

> Hi Josh,
>
> Can you please clarify whether date comparisons as two strings work at all?
>
> I was under the impression is that with string comparison only first
> characters are compared?
>
> Thanks
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 14 April 2016 at 19:26, Josh Rosen  wrote:
>
>> AFAIK this is not being pushed down because it involves an implicit cast
>> and we currently don't push casts into data sources or scans; see
>> https://github.com/databricks/spark-redshift/issues/155 for a
>> possibly-related discussion.
>>
>> On Thu, Apr 14, 2016 at 10:27 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Are you comparing strings in here or timestamp?
>>>
>>> Filter ((cast(registration#37 as string) >= 2015-05-28) &&
>>> (cast(registration#37 as string) <= 2015-05-29))
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> *
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 14 April 2016 at 18:04, Kiran Chitturi >> > wrote:
>>>
 Hi,

 Timestamp range filter queries in SQL are not getting pushed down to
 the PrunedFilteredScan instances. The filtering is happening at the Spark
 layer.

 The physical plan for timestamp range queries is not showing the pushed
 filters where as range queries on other types is working fine as the
 physical plan is showing the pushed filters.

 Please see below for code and examples.

 *Example:*

 *1.* Range filter queries on Timestamp types

*code: *

> sqlContext.sql("SELECT * from events WHERE `registration` >=
> '2015-05-28' AND `registration` <= '2015-05-29' ")

*Full example*:
 https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
 *plan*:
 https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-time-range-sql

 *2. * Range filter queries on Long types

 *code*:

> sqlContext.sql("SELECT * from events WHERE `length` >= '700' and
> `length` <= '1000'")

 *Full example*:
 https://github.com/lucidworks/spark-solr/blob/master/src/test/scala/com/lucidworks/spark/EventsimTestSuite.scala#L151
 *plan*:
 https://gist.github.com/kiranchitturi/4a52688c9f0abe3d4b2bd8b938044421#file-length-range-sql

 The SolrRelation class we use extends
 
 the PrunedFilteredScan.

 Since Solr supports date ranges, I would like for the timestamp filters
 to be pushed down to the Solr query.

 Are there limitations on the type of filters that are passed down with
 Timestamp types ?
 Is there something that I should do in my code to fix this ?

 Thanks,
 --
 Kiran Chitturi


>>>
>


Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
I hope it was not too late :).

It is possible.

Please check csvRdd api here,
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150
.

Thanks!
On 2 Apr 2016 2:47 a.m., "Benjamin Kim"  wrote:

> Does anyone know if this is possible? I have an RDD loaded with rows of
> CSV data strings. Each string representing the header row and multiple rows
> of data along with delimiters. I would like to feed each thru a CSV parser
> to convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase
> table with this data.
>
> Please let me know if you have any ideas.
>
> Thanks,
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: In-Memory Only Spark Shuffle

2016-04-15 Thread Hyukjin Kwon
This reminds me of this Jira,
https://issues.apache.org/jira/browse/SPARK-3376 and this PR,
https://github.com/apache/spark/pull/5403.

AFAIK, it is not and won't be supported.
On 2 Apr 2016 4:13 a.m., "slavitch"  wrote:

> Hello;
>
> I’m working on spark with very large memory systems (2TB+) and notice that
> Spark spills to disk in shuffle.  Is there a way to force spark to stay
> exclusively in memory when doing shuffle operations?   The goal is to keep
> the shuffle data either in the heap or in off-heap memory (in 1.6.x) and
> never touch the IO subsystem.  I am willing to have the job fail if it runs
> out of RAM.
>
> spark.shuffle.spill true  is deprecated in 1.6 and does not work in
> Tungsten
> sort in 1.5.x
>
> "WARN UnsafeShuffleManager: spark.shuffle.spill was set to false, but this
> is ignored by the tungsten-sort shuffle manager; its optimized shuffles
> will
> continue to spill to disk when necessary.”
>
> If this is impossible via configuration changes what code changes would be
> needed to accomplish this?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/In-Memory-Only-Spark-Shuffle-tp26661.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: can spark-csv package accept strings instead of files?

2016-04-15 Thread Hyukjin Kwon
Hi,

Would you try this codes below?

val csvRDD = ...your processimg for csv rdd..
val df = new CsvParser().csvRdd(sqlContext, csvRDD, useHeader = true)

Thanks!
On 16 Apr 2016 1:35 a.m., "Benjamin Kim"  wrote:

> Hi Hyukjin,
>
> I saw that. I don’t know how to use it. I’m still learning Scala on my
> own. Can you help me to start?
>
> Thanks,
> Ben
>
> On Apr 15, 2016, at 8:02 AM, Hyukjin Kwon  wrote:
>
> I hope it was not too late :).
>
> It is possible.
>
> Please check csvRdd api here,
> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L150
> .
>
> Thanks!
> On 2 Apr 2016 2:47 a.m., "Benjamin Kim"  wrote:
>
>> Does anyone know if this is possible? I have an RDD loaded with rows of
>> CSV data strings. Each string representing the header row and multiple rows
>> of data along with delimiters. I would like to feed each thru a CSV parser
>> to convert the data into a dataframe and, ultimately, UPSERT a Hive/HBase
>> table with this data.
>>
>> Please let me know if you have any ideas.
>>
>> Thanks,
>> Ben
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: JSON Usage

2016-04-17 Thread Hyukjin Kwon
Hi!

Personally, I don't think it necessarily needs to be DataSet for your goal.

Just select your data at "s3" from DataFrame loaded by
sqlContext.read.json().

You can try to printSchema() to check the nested schema and then select the
data.

Also, I guess (from your codes) you are trying to send a reauest, fetch the
response to driver-side, and then send each message to executor-side. I
guess there would be really heavy overhead in driver-side.
Holden,

If I were to use DataSets, then I would essentially do this:

val receiveMessageRequest = new ReceiveMessageRequest(myQueueUrl)
val messages = sqs.receiveMessage(receiveMessageRequest).getMessages()
for (message <- messages.asScala) {
val files = sqlContext.read.json(message.getBody())
}

Can I simply do files.toDS() or do I have to create a schema using a case
class File and apply it as[File]? If I have to apply a schema, then how
would I create it based on the JSON structure below, especially the nested
elements.

Thanks,
Ben


On Apr 14, 2016, at 3:46 PM, Holden Karau  wrote:

You could certainly use RDDs for that, you might also find using Dataset
selecting the fields you need to construct the URL to fetch and then using
the map function to be easier.

On Thu, Apr 14, 2016 at 12:01 PM, Benjamin Kim  wrote:

> I was wonder what would be the best way to use JSON in Spark/Scala. I need
> to lookup values of fields in a collection of records to form a URL and
> download that file at that location. I was thinking an RDD would be perfect
> for this. I just want to hear from others who might have more experience in
> this. Below is the actual JSON structure that I am trying to use for the S3
> bucket and key values of each “record" within “Records".
>
> {
>"Records":[
>   {
>  "eventVersion":"2.0",
>  "eventSource":"aws:s3",
>  "awsRegion":"us-east-1",
>  "eventTime":The time, in ISO-8601 format, for example,
> 1970-01-01T00:00:00.000Z, when S3 finished processing the request,
>  "eventName":"event-type",
>  "userIdentity":{
>
> "principalId":"Amazon-customer-ID-of-the-user-who-caused-the-event"
>  },
>  "requestParameters":{
> "sourceIPAddress":"ip-address-where-request-came-from"
>  },
>  "responseElements":{
> "x-amz-request-id":"Amazon S3 generated request ID",
> "x-amz-id-2":"Amazon S3 host that processed the request"
>  },
>  "s3":{
> "s3SchemaVersion":"1.0",
> "configurationId":"ID found in the bucket notification
> configuration",
> "bucket":{
>"name":"bucket-name",
>"ownerIdentity":{
>   "principalId":"Amazon-customer-ID-of-the-bucket-owner"
>},
>"arn":"bucket-ARN"
> },
> "object":{
>"key":"object-key",
>"size":object-size,
>"eTag":"object eTag",
>"versionId":"object version if bucket is
> versioning-enabled, otherwise null",
>"sequencer": "a string representation of a hexadecimal
> value used to determine event sequence,
>only used with PUTs and DELETEs"
> }
>  }
>   },
>   {
>   // Additional events
>   }
>]
> }
>
> Thanks
> Ben
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: WELCOME to user@spark.apache.org

2016-04-17 Thread Hyukjin Kwon
Hi Jinan,


There are some examples for XML here,
https://github.com/databricks/spark-xml/blob/master/src/test/java/com/databricks/spark/xml/JavaXmlSuite.java
for test codes.

Or, you can see documentation in README.md.
https://github.com/databricks/spark-xml#java-api.


There are other basic Java examples here,
https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples
.


Basic steps are explained well in a book, Learning Spark (you can just
google it).


I also see this is explained well in official document here,
http://spark.apache.org/docs/latest/programming-guide.html.


I hope this can help


Thanks!



2016-04-18 9:37 GMT+09:00 jinan_alhajjaj :

> Hello,
> I would like to know how to parse XML files using Apache spark by java
> language.  I am doing this for my senior project and I am a beginner in
> Apache Spark and I have just a little experience with spark.
>  Thank you.
> On Apr 18, 2016, at 3:14 AM, user-h...@spark.apache.org wrote:
>
> Hi! This is the ezmlm program. I'm managing the
> user@spark.apache.org mailing list.
>
> Acknowledgment: I have added the address
>
>   j.r.alhaj...@hotmail.com
>
> to the user mailing list.
>
> Welcome to user@spark.apache.org!
>
> Please save this message so that you know the address you are
> subscribed under, in case you later want to unsubscribe or change your
> subscription address.
>
>
> --- Administrative commands for the user list ---
>
> I can handle administrative requests automatically. Please
> do not send them to the list address! Instead, send
> your message to the correct command address:
>
> To subscribe to the list, send a message to:
>   
>
> To remove your address from the list, send a message to:
>   
>
> Send mail to the following for info and FAQ for this list:
>   
>   
>
> Similar addresses exist for the digest list:
>   
>   
>
> To get messages 123 through 145 (a maximum of 100 per request), mail:
>   
>
> To get an index with subject and author for messages 123-456 , mail:
>   
>
> They are always returned as sets of 100, max 2000 per request,
> so you'll actually get 100-499.
>
> To receive all messages with the same subject as message 12345,
> send a short message to:
>   
>
> The messages should contain one line or word of text to avoid being
> treated as sp@m, but I will ignore their content.
> Only the ADDRESS you send to is important.
>
> You can start a subscription for an alternate address,
> for example "john@host.domain", just add a hyphen and your
> address (with '=' instead of '@') after the command word:
> 
>
> To stop subscription for this address, mail:
> 
>
> In both cases, I'll send a confirmation message to that address. When
> you receive it, simply reply to it to complete your subscription.
>
> If despite following these instructions, you do not get the
> desired results, please contact my owner at
> user-ow...@spark.apache.org. Please be patient, my owner is a
> lot slower than I am ;-)
>
> --- Enclosed is a copy of the request I received.
>
> Return-Path: 
> Received: (qmail 84366 invoked by uid 99); 18 Apr 2016 00:14:49 -
> Received: from pnap-us-west-generic-nat.apache.org (HELO
> spamd4-us-west.apache.org) (209.188.14.142)
>by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Apr 2016 00:14:49
> +
> Received: from localhost (localhost [127.0.0.1])
> by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org)
> with ESMTP id AC11BC0D0C
> for <
> user-sc.1460937887.ganmfjokmmhahlokbknk-j.r.alhajjaj=hotmail@spark.apache.org>;
> Mon, 18 Apr 2016 00:14:48 + (UTC)
> X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org
> X-Spam-Flag: NO
> X-Spam-Score: -0.722
> X-Spam-Level:
> X-Spam-Status: No, score=-0.722 tagged_above=-999 required=6.31
> tests=[RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01,
> RCVD_IN_MSPIKE_WL=-0.01, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001]
> autolearn=disabled
> Received: from mx1-lw-eu.apache.org ([10.40.0.8])
> by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port
> 10024)
> with ESMTP id I6zkynvOTY46
> for <
> user-sc.1460937887.ganmfjokmmhahlokbknk-j.r.alhajjaj=hotmail@spark.apache.org
> >;
> Mon, 18 Apr 2016 00:14:46 + (UTC)
> Received: from BLU004-OMC2S2.hotmail.com (blu004-omc2s2.hotmail.com
> [65.55.111.77])
> by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with
> ESMTPS id D6A625F59E
> for <
> user-sc.1460937887.ganmfjokmmhahlokbknk-j.r.alhajjaj=hotmail@spark.apache.org>;
> Mon, 18 Apr 2016 00:14:45 + (UTC)
> Received: from BLU437-SMTP95 ([65.55.111.71]) by BLU004-OMC2S2.hotmail.com
> over TLS secured channel with Microsoft SMTPSVC(7.5.7601.23008);
> Sun, 17 Apr 2016 17:14:39 -0700
> X-TMN: [2ipc9V0z78DTNqpYmThAoXNTIh2Ptj12]
> X-Originating-Email: [j.r.alhaj...@hotmail.com]
> Message-ID: 
> From: jinan_alhajjaj 
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
> Subject: Parse XML file
> Date: Mon, 18 Apr 2016

Re: How does .jsonFile() work?

2016-04-19 Thread Hyukjin Kwon
Hi,

I hope I understood correctly. This is a simplified procedures.

Precondition

 - JSON file is written line by line. Each is each JSON document.
 - Root array is supported,
eg.
[{...},
{...}
{...}]

Procedures

- Schema inference (If user schema is not given)

 1. Read line JSON document.
 2. Read token by token by Jackson parser (recursively if needed)
 3. Find (or infer) ,from each value , types appropriate with Spark. (eg.
DateType, StringType and DecimalType)
 4. 3. returns a StructType for underlying a single JSON document.
 5. 1.-4. becomes an RDD,  RDD[DataType]
 6. Aggregates each  DataType (StructType normally) into a single
StructType compatible across all the StructTypes.
 7. Use the aggregated StructType as a schema.

- Parse JSON data.

 1. Read line JSON document.
 2. Read token by token by Jackson parser (recursively if needed)
 3. Convert each value to the given type (above).
 4. 3. returns a Row for underlying a single JSON document.

Thanks!


2016-04-20 11:07 GMT+09:00 resonance :

> Hi, this is more of a theoretical question and I'm asking it here because I
> have no idea where to find documentation for this stuff.
>
> I am currently working with Spark SQL and am considering using data
> contained within JSON datasets. I am aware of the .jsonFile() method in
> Spark SQL.
>
> What is the general strategy used by Spark SQL .jsonFile() to parse/decode
> a
> JSON dataset?
>
> (For example, an answer I might be looking for is that the JSON file is
> read
> into an ETL pipeline and transformed into a predefined data structure.)
>
> I am deeply appreciative of any help provided.
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-does-jsonFile-work-tp26802.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: XML Data Source for Spark

2016-04-25 Thread Hyukjin Kwon
Hi Janan,

Sorry, I was sleeping. I guess you sent a email to me first and then ask it
to mailing list because I am not answering.

I just tested this to double-check and could produce the same exception
below:

java.lang.NoSuchMethodError:
scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
at com.databricks.spark.xml.XmlOptions.(XmlOptions.scala:26)
at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:48)
at
com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:58)
at
com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
at
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:125)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)



Try to another Scala compiled one. There are two as below:

Scala 2.10

groupId: com.databricks
artifactId: spark-xml_2.10
version: 0.3.3

Scala 2.11

groupId: com.databricks
artifactId: spark-xml_2.11
version: 0.3.3



2016-04-26 6:57 GMT+09:00 Michael Armbrust :

> You are using a version of the library that was compiled for a different
> version of Scala than the version of Spark that you are using.  Make sure
> that they match up.
>
> On Mon, Apr 25, 2016 at 5:19 PM, Mohamed ismail <
> mismai...@yahoo.com.invalid> wrote:
>
>> here is an example with code.
>> http://stackoverflow.com/questions/33078221/xml-processing-in-spark
>>
>> I haven't tried.
>>
>>
>> On Monday, April 25, 2016 1:06 PM, Jinan Alhajjaj <
>> j.r.alhaj...@hotmail.com> wrote:
>>
>>
>> Hi All,
>> I am trying to use XML data source that is used for parsing and querying
>> XML data with Apache Spark, for Spark SQL and data frames.I am using Apache
>> spark version 1.6.1 and I am using Java as a programming language.
>> I wrote this sample code :
>> SparkConf conf = new SparkConf().setAppName("parsing").setMaster("local");
>>
>> JavaSparkContext sc = new JavaSparkContext(conf);
>>
>> SQLContext sqlContext = new SQLContext(sc);
>>
>> DataFrame df =
>> sqlContext.read().format("com.databricks.spark.xml").option("rowtag",
>> "page").load("file.xml");
>>
>> When I run this code I faced a problem which is
>> Exception in thread "main" java.lang.NoSuchMethodError:
>> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
>> at com.databricks.spark.xml.XmlOptions.(XmlOptions.scala:26)
>> at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:48)
>> at
>> com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:58)
>> at
>> com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:44)
>> at
>> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>> at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:109)
>> at datbricxml.parsing.main(parsing.java:16).
>> Please, I need to solve this error for my senior project ASAP.
>>
>>
>>
>>
>


Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Could you maybe share your codes?
On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:

> Hi,
>
> I had loaded JSON file in parquet format into SparkSQL. I can't able to
> read List which is inside JSON.
>
> Sample JSON
>
> {
> "TOUR" : {
>  "CITIES" : ["Paris","Berlin","Prague"]
> },
> "BUDJET" : 100
> }
>
> I want to read value of CITIES.
>
> *Thanks*,
> 
>
>


Re: Spark SQL query for List

2016-04-26 Thread Hyukjin Kwon
Doesn't get(0) give you the Array[String] for CITY (am I missing something?)
On 26 Apr 2016 11:02 p.m., "Ramkumar V"  wrote:

JavaSparkContext ctx = new JavaSparkContext(sparkConf);

SQLContext sqlContext = new SQLContext(ctx);

DataFrame parquetFile = sqlContext.parquetFile(
"hdfs:/XYZ:8020/user/hdfs/parquet/*.parquet");

   parquetFile.registerTempTable("parquetFile");

DataFrame tempDF = sqlContext.sql("SELECT TOUR.CITIES, BUDJET from
parquetFile");

JavaRDDjRDD = tempDF.toJavaRDD();

 JavaRDD ones = jRDD.map(new Function() {

  public String call(Row row) throws Exception {

return row.getString(1);

  }

});

*Thanks*,
<https://in.linkedin.com/in/ramkumarcs31>


On Tue, Apr 26, 2016 at 3:48 PM, Hyukjin Kwon  wrote:

> Could you maybe share your codes?
> On 26 Apr 2016 9:51 p.m., "Ramkumar V"  wrote:
>
>> Hi,
>>
>> I had loaded JSON file in parquet format into SparkSQL. I can't able to
>> read List which is inside JSON.
>>
>> Sample JSON
>>
>> {
>> "TOUR" : {
>>  "CITIES" : ["Paris","Berlin","Prague"]
>> },
>> "BUDJET" : 100
>> }
>>
>> I want to read value of CITIES.
>>
>> *Thanks*,
>> <https://in.linkedin.com/in/ramkumarcs31>
>>
>>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Is JavaSparkContext.wholeTextFiles distributed?

2016-04-26 Thread Hyukjin Kwon
EDIT: not mapper but a task for HadoopRDD maybe as far as I know.

I think the most clear way is just to run a job on multiple files with the
API and check the number of tasks in the job.
On 27 Apr 2016 12:06 a.m., "Hyukjin Kwon"  wrote:

wholeTextFile() API uses WholeTextFileInputFormat,
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/input/WholeTextFileInputFormat.scala,
which returns false for isSplittable. In this case, only single mapper
appears for the entire file as far as I know.

And also https://spark.apache.org/docs/1.6.0/programming-guide.html

If the file is single file, then this would not be distributed.
On 26 Apr 2016 11:52 p.m., "Ted Yu"  wrote:

> Please take a look at:
> core/src/main/scala/org/apache/spark/SparkContext.scala
>
>* Do `val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")`,
>*
>*  then `rdd` contains
>* {{{
>*   (a-hdfs-path/part-0, its content)
>*   (a-hdfs-path/part-1, its content)
>*   ...
>*   (a-hdfs-path/part-n, its content)
>* }}}
> ...
>   * @param minPartitions A suggestion value of the minimal splitting
> number for input data.
>
>   def wholeTextFiles(
>   path: String,
>   minPartitions: Int = defaultMinPartitions): RDD[(String, String)] =
> withScope {
>
> On Tue, Apr 26, 2016 at 7:43 AM, Vadim Vararu 
> wrote:
>
>> Hi guys,
>>
>> I'm trying to read many filed from s3 using
>> JavaSparkContext.wholeTextFiles(...). Is that executed in a distributed
>> manner? Please give me a link to the place in documentation where it's
>> specified.
>>
>> Thanks, Vadim.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: removing header from csv file

2016-04-26 Thread Hyukjin Kwon
There are two ways to do so.


Firstly, this way will make sure cleanly it skips the header. But of course
the use of mapWithIndex decreases performance

rdd.mapPartitionsWithIndex { (idx, iter) => if (idx == 0) iter.drop(1) else
iter }


Secondly, you can do

val header = rdd.first()
val data = rdd.filter(_ != first)

For the second method, this does not make sure it will only skip the first
because there might be the exactly same records with the header.


CSV data source uses the second way so I gave a todo in the PR I recently
opened.



2016-04-27 14:59 GMT+09:00 nihed mbarek :

> You can add a filter with string that you are sure available only in the
> header
>
>
> Le mercredi 27 avril 2016, Divya Gehlot  a
> écrit :
>
>> yes you can remove the headers by removing the first row
>>
>> can first() or head() to do that
>>
>>
>> Thanks,
>> Divya
>>
>> On 27 April 2016 at 13:24, Ashutosh Kumar 
>> wrote:
>>
>>> I see there is a library spark-csv which can be used for removing header
>>> and processing of csv files. But it seems it works with sqlcontext only. Is
>>> there a way to remove header from csv files without sqlcontext ?
>>>
>>> Thanks
>>> Ashutosh
>>>
>>
>>
>
> --
>
> M'BAREK Med Nihed,
> Fedora Ambassador, TUNISIA, Northern Africa
> http://www.nihed.com
>
> 
>
>
>


Re: Error in spark-xml

2016-04-30 Thread Hyukjin Kwon
Hi Sourav,

I think it is an issue. XML will assume the element by the rowTag as object.

 Could you please open an issue in
https://github.com/databricks/spark-xml/issues please?

Thanks!


2016-05-01 5:08 GMT+09:00 Sourav Mazumder :

> Hi,
>
> Looks like there is a problem in spark-xml if the xml has multiple
> attributes with no child element.
>
> For example say the xml has a nested object as below
> 
> bk_113
> bk_114
>  
>
> Now if I create a dataframe starting with rowtag bkval and then I do a
> select on that data frame it gives following error.
>
>
> scala.MatchError: ENDDOCUMENT (of class
> com.sun.xml.internal.stream.events.EndDocumentEvent) at
> com.databricks.spark.xml.parsers.StaxXmlParser$.checkEndElement(StaxXmlParser.scala:94)
> at
> com.databricks.spark.xml.parsers.StaxXmlParser$.com$databricks$spark$xml$parsers$StaxXmlParser$$convertObject(StaxXmlParser.scala:295)
> at
> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:58)
> at
> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:46)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at
> scala.collection.Iterator$class.foreach(Iterator.scala:727) at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at
> org.apache.spark.scheduler.Task.run(Task.scala:88) at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> However if there is only one row like below, it works fine.
>
> 
> bk_113
> 
>
> Any workaround ?
>
> Regards,
> Sourav
>
>


Re: Error in spark-xml

2016-05-01 Thread Hyukjin Kwon
To be more clear,

If you set the rowTag as "book", then it will produces an exception which
is an issue opened here, https://github.com/databricks/spark-xml/issues/92

Currently it does not support to parse a single element with only a value
as a row.


If you set the rowTag as "bkval", then it should work. I tested the case
below to double check.

If it does not work as below, please open an issue with some information so
that I can reproduce.


I tested the case above with the data below


  
bk_113
bk_114
  
  
bk_114
bk_116
  
  
bk_115
bk_116
  



I tested this with the codes below

val path = "path-to-file"
sqlContext.read
  .format("xml")
  .option("rowTag", "bkval")
  .load(path)
  .show()

​

Thanks!


2016-05-01 15:11 GMT+09:00 Hyukjin Kwon :

> Hi Sourav,
>
> I think it is an issue. XML will assume the element by the rowTag as
> object.
>
>  Could you please open an issue in
> https://github.com/databricks/spark-xml/issues please?
>
> Thanks!
>
>
> 2016-05-01 5:08 GMT+09:00 Sourav Mazumder :
>
>> Hi,
>>
>> Looks like there is a problem in spark-xml if the xml has multiple
>> attributes with no child element.
>>
>> For example say the xml has a nested object as below
>> 
>> bk_113
>> bk_114
>>  
>>
>> Now if I create a dataframe starting with rowtag bkval and then I do a
>> select on that data frame it gives following error.
>>
>>
>> scala.MatchError: ENDDOCUMENT (of class
>> com.sun.xml.internal.stream.events.EndDocumentEvent) at
>> com.databricks.spark.xml.parsers.StaxXmlParser$.checkEndElement(StaxXmlParser.scala:94)
>> at
>> com.databricks.spark.xml.parsers.StaxXmlParser$.com$databricks$spark$xml$parsers$StaxXmlParser$$convertObject(StaxXmlParser.scala:295)
>> at
>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:58)
>> at
>> com.databricks.spark.xml.parsers.StaxXmlParser$$anonfun$parse$1$$anonfun$apply$4.apply(StaxXmlParser.scala:46)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
>> scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) at
>> scala.collection.Iterator$class.foreach(Iterator.scala:727) at
>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>> at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:215)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1850)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at
>> org.apache.spark.scheduler.Task.run(Task.scala:88) at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> However if there is only one row like below, it works fine.
>>
>> 
>> bk_113
>> 
>>
>> Any workaround ?
>>
>> Regards,
>> Sourav
>>
>>
>


Re: Parse Json in Spark

2016-05-08 Thread Hyukjin Kwon
I remember this Jira, https://issues.apache.org/jira/browse/SPARK-7366.
Parsing multiple lines are not supported in Json fsta source.

Instead this can be done by sc.wholeTextFiles(). I found some examples
here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

Although this reads a file as a whole record, this should work.

Thanks!
On 9 May 2016 7:20 a.m., "KhajaAsmath Mohammed" 
wrote:

> Hi,
>
> I am working on parsing the json in spark but most of the information
> available online states that  I need to have entire JSON in single line.
>
> In my case, Json file is delivered in complex structure and not in a
> single line. could anyone know how to process this in SPARK.
>
> I used Jackson jar to process json and was able to do it when it is
> present in single line. Any ideas?
>
> Thanks,
> Asmath
>


Re: XML Processing using Spark SQL

2016-05-12 Thread Hyukjin Kwon
Hi Arunkumar,


I guess your records are self-closing ones.

There is an issue open here,
https://github.com/databricks/spark-xml/issues/92

This is about XmlInputFormat.scala and it seems a bit tricky to handle the
case so I left open until now.


Thanks!


2016-05-13 5:03 GMT+09:00 Arunkumar Chandrasekar :

> Hello,
>
> Greetings.
>
> I'm trying to process a xml file exported from Health Kit application
> using Spark SQL for learning purpose. The sample record data is like the
> below:
>
>   sourceVersion="9.3" device="<, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3>"
> unit="count" creationDate="2016-04-23 19:31:33 +0530" startDate="2016-04-23
> 19:00:20 +0530" endDate="2016-04-23 19:01:41 +0530" value="31"/>
>
>   sourceVersion="9.3.1" device="<, name:iPhone,
> manufacturer:Apple, model:iPhone, hardware:iPhone7,2, software:9.3.1>"
> unit="count" creationDate="2016-04-24 05:45:00 +0530" startDate="2016-04-24
> 05:25:04 +0530" endDate="2016-04-24 05:25:24 +0530" value="10"/>.
>
> I want to have the column name of my table as the field value like type,
> sourceName, sourceVersion and the row entries as their respective values
> like HKQuantityTypeIdentifierStepCount, Vizhi, 9.3.1,..
>
> I took a look at the Spark-XML ,
> but didn't get any information in my case (my xml is not well formed with
> the tags). Is there any other option to convert the record that I have
> mentioned above into a schema format for playing with Spark SQL?
>
> Thanks in Advance.
>
> *Thank You*,
> Arun Chandrasekar
> chan.arunku...@gmail.com
>


Re: Does spark support Apache Arrow

2016-05-19 Thread Hyukjin Kwon
FYI, there is a JIRA for this,
https://issues.apache.org/jira/browse/SPARK-13534

I hope this link is helpful.

Thanks!


2016-05-20 11:18 GMT+09:00 Sun Rui :

> 1. I don’t think so
> 2. Arrow is for in-memory columnar execution. While cache is for in-memory
> columnar storage
>
> On May 20, 2016, at 10:16, Todd  wrote:
>
> From the official site http://arrow.apache.org/, Apache Arrow is used for
> Columnar In-Memory storage. I have two quick questions:
> 1. Does spark support Apache Arrow?
> 2. When dataframe is cached in memory, the data are saved in columnar
> in-memory style. What is the relationship between this feature and Apache
> Arrow,that is,
> when the data is in Apache Arrow format,does spark still need the effort
> to cache the dataframe in columnar in-memory?
>
>
>


Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Yea, I met this case before. I guess this is related with
https://issues.apache.org/jira/browse/SPARK-15393.

2016-06-15 8:46 GMT+09:00 antoniosi :

> I tried the following code in both Spark 1.5.1 and Spark 1.6.0:
>
> import org.apache.spark.sql.types.{
> StructType, StructField, StringType, IntegerType}
> import org.apache.spark.sql.Row
>
> val schema = StructType(
> StructField("k", StringType, true) ::
> StructField("v", IntegerType, false) :: Nil)
>
> sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
> df.write.save("hdfs://xxx")
>
> Both 1.5.1 and 1.6.0 only save _SUCCESS file. It does not save any
> _metadata
> files. Also, in 1.6.0, it also gives the following error:
>
> 16/06/14 16:29:27 WARN ParquetOutputCommitter: could not write summary file
> for hdfs://xxx
> java.lang.NullPointerException
> at
>
> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
> at
>
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
> at
>
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
> at
>
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at
>
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
> at
>
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>
> I do not get this exception in 1.5.1 version though.
>
> I see this bug https://issues.apache.org/jira/browse/SPARK-15393, but this
> is for Spark 2.0. Is there a same bug in Spark 1.5.1 and 1.6?
>
> Is there a way we could save an empty dataframe properly?
>
> Thanks.
>
> Antonio.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Writing-empty-Dataframes-doesn-t-save-any-metadata-files-in-Spark-1-5-1-and-1-6-tp27169.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Writing empty Dataframes doesn't save any _metadata files in Spark 1.5.1 and 1.6

2016-06-14 Thread Hyukjin Kwon
Ops, I just so the link. It is not actually only for Spark 2.0.


To be clear, https://issues.apache.org/jira/browse/SPARK-15393 was a bit
different with your case (it was about writing empty data frame with empty
partitions).

This was caused by https://github.com/apache/spark/pull/12855 and reverted.



I wrote your case in the comments in that JIRA.



2016-06-15 10:26 GMT+09:00 Hyukjin Kwon :

> Yea, I met this case before. I guess this is related with
> https://issues.apache.org/jira/browse/SPARK-15393.
>
> 2016-06-15 8:46 GMT+09:00 antoniosi :
>
>> I tried the following code in both Spark 1.5.1 and Spark 1.6.0:
>>
>> import org.apache.spark.sql.types.{
>> StructType, StructField, StringType, IntegerType}
>> import org.apache.spark.sql.Row
>>
>> val schema = StructType(
>> StructField("k", StringType, true) ::
>> StructField("v", IntegerType, false) :: Nil)
>>
>> sqlContext.createDataFrame(sc.emptyRDD[Row], schema)
>> df.write.save("hdfs://xxx")
>>
>> Both 1.5.1 and 1.6.0 only save _SUCCESS file. It does not save any
>> _metadata
>> files. Also, in 1.6.0, it also gives the following error:
>>
>> 16/06/14 16:29:27 WARN ParquetOutputCommitter: could not write summary
>> file
>> for hdfs://xxx
>> java.lang.NullPointerException
>> at
>>
>> org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
>> at
>>
>> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
>> at
>>
>> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
>> at
>>
>> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
>> at
>>
>> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
>> at
>>
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
>> at
>>
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>> at
>>
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:108)
>>
>> I do not get this exception in 1.5.1 version though.
>>
>> I see this bug https://issues.apache.org/jira/browse/SPARK-15393, but
>> this
>> is for Spark 2.0. Is there a same bug in Spark 1.5.1 and 1.6?
>>
>> Is there a way we could save an empty dataframe properly?
>>
>> Thanks.
>>
>> Antonio.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Writing-empty-Dataframes-doesn-t-save-any-metadata-files-in-Spark-1-5-1-and-1-6-tp27169.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: how to load compressed (gzip) csv file using spark-csv

2016-06-16 Thread Hyukjin Kwon
It will 'auto-detect' the compression codec by the file extension and then
will decompress and read it correctly.

Thanks!

2016-06-16 20:27 GMT+09:00 Vamsi Krishna :

> Hi,
>
> I'm using Spark 1.4.1 (HDP 2.3.2).
> As per the spark-csv documentation (
> https://github.com/databricks/spark-csv), I see that we can write to a
> csv file in compressed form using the 'codec' option.
> But, didn't see the support for 'codec' option to read a csv file.
>
> Is there a way to read a compressed (gzip) file using spark-csv?
>
> Thanks,
> Vamsi Attluri
> --
> Vamsi Attluri
>


Re: Processing json document

2016-07-06 Thread Hyukjin Kwon
There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..
​


2016-07-07 14:48 GMT+09:00 Lan Jiang :

> Hi, there
>
> Spark has provided json document processing feature for a long time. In
> most examples I see, each line is a json object in the sample file. That is
> the easiest case. But how can we process a json document, which does not
> conform to this standard format (one line per json object)? Here is the
> document I am working on.
>
> First of all, it is multiple lines for one single big json object. The
> real file can be as long as 20+ G. Within that one single json object, it
> contains many name/value pairs. The name is some kind of id values. The
> value is the actual json object that I would like to be part of dataframe.
> Is there any way to do that? Appreciate any input.
>
>
> {
> "id1": {
> "Title":"title1",
> "Author":"Tom",
> "Source":{
> "Date":"20160506",
> "Type":"URL"
> },
> "Data":" blah blah"},
>
> "id2": {
> "Title":"title2",
> "Author":"John",
> "Source":{
> "Date":"20150923",
> "Type":"URL"
> },
> "Data":" blah blah "},
>
> "id3: {
> "Title":"title3",
> "Author":"John",
> "Source":{
> "Date":"20150902",
> "Type":"URL"
> },
> "Data":" blah blah "}
> }
>
>


Re: Processing json document

2016-07-06 Thread Hyukjin Kwon
The link uses wholeTextFiles() API which treats each file as each record.


2016-07-07 15:42 GMT+09:00 Jörn Franke :

> This does not need necessarily the case if you look at the Hadoop
> FileInputFormat architecture then you can even split large multi line Jsons
> without issues. I would need to have a look at it, but one large file does
> not mean one Executor independent of the underlying format.
>
> On 07 Jul 2016, at 08:12, Hyukjin Kwon  wrote:
>
> There is a good link for this here,
> http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files
>
> If there are a lot of small files, then it would work pretty okay in a
> distributed manner, but I am worried if it is single large file.
>
> In this case, this would only work in single executor which I think will
> end up with OutOfMemoryException.
>
> Spark JSON data source does not support multi-line JSON as input due to
> the limitation of TextInputFormat and LineRecordReader.
>
> You may have to just extract the values after reading it by textFile..
> ​
>
>
> 2016-07-07 14:48 GMT+09:00 Lan Jiang :
>
>> Hi, there
>>
>> Spark has provided json document processing feature for a long time. In
>> most examples I see, each line is a json object in the sample file. That is
>> the easiest case. But how can we process a json document, which does not
>> conform to this standard format (one line per json object)? Here is the
>> document I am working on.
>>
>> First of all, it is multiple lines for one single big json object. The
>> real file can be as long as 20+ G. Within that one single json object, it
>> contains many name/value pairs. The name is some kind of id values. The
>> value is the actual json object that I would like to be part of dataframe.
>> Is there any way to do that? Appreciate any input.
>>
>>
>> {
>> "id1": {
>> "Title":"title1",
>> "Author":"Tom",
>> "Source":{
>> "Date":"20160506",
>> "Type":"URL"
>> },
>> "Data":" blah blah"},
>>
>> "id2": {
>> "Title":"title2",
>> "Author":"John",
>> "Source":{
>> "Date":"20150923",
>> "Type":"URL"
>> },
>> "Data":" blah blah "},
>>
>> "id3: {
>> "Title":"title3",
>> "Author":"John",
>> "Source":{
>> "Date":"20150902",
>> "Type":"URL"
>> },
>> "Data":" blah blah "}
>> }
>>
>>
>


RE: Processing json document

2016-07-07 Thread Hyukjin Kwon
Yea, I totally agree with Yong.

Anyway, this might not be a great idea but you might want to take a look
this,
http://pivotal-field-engineering.github.io/pmr-common/pmr/apidocs/com/gopivotal/mapreduce/lib/input/JsonInputFormat.html

This does not recognise nested structure but I assume you might be able to
do this by, for example, removing the first "{" and last "}" in your large
file and then loading it so that id object in your data can be recognised
as a row, or modifying the JsonInputFormat.

After that, you might be able to load this by SparkContext.hadoopFile or
SparkContext.newHadoopFile API as a RDD which consist of each row having
each json doc. And then, there is SQLContext.json API which takes RDD
consist of each row having each json document.

I know this is a rough and not the best idea but this is only way I
currently think of..
The problem is for Hadoop Input format to identify the record delimiter. If
the whole json record is in one line, then the nature record delimiter will
be the new line character.

Keep in mind in distribute file system, the file split position most likely
IS not on the record delimiter. The input format implementation has to go
back or forward in the bytes array looking for the next record delimiter on
another node.

Without a perfect record delimiter, then you just has to parse the whole
file, as you know the file boundary is a reliable record delimiter.

JSON is Never a good format to be stored in BigData platform. If your
source json is liking this, then you have to preprocess it. Or write your
own implementation to handle the record delimiter, for your json data case.
But good luck with that. There is no perfect generic solution for any kind
of JSON data you want to handle.

Yong

--
From: ljia...@gmail.com
Date: Thu, 7 Jul 2016 11:57:26 -0500
Subject: Re: Processing json document
To: gurwls...@gmail.com
CC: jornfra...@gmail.com; user@spark.apache.org

Hi, there,

Thank you all for your input. @Hyukjin, as a matter of fact, I have read
the blog link you posted before asking the question on the forum. As you
pointed out, the link uses wholeTextFiles(0, which is bad in my case,
because my json file can be as large as 20G+ and OOM might occur. I am not
sure how to extract the value by using textFile call as it will create an
RDD of string and treat each line without ordering. It destroys the json
context.

Large multiline json file with parent node are very common in the real
world. Take the common employees json example below, assuming we have
millions of employee and it is super large json document, how can spark
handle this? This should be a common pattern, shouldn't it? In real world,
json document does not always come as cleanly formatted as the spark
example requires.

{
"employees":[
{
  "firstName":"John",
  "lastName":"Doe"
},
{
  "firstName":"Anna",
   "lastName":"Smith"
    },
{
   "firstName":"Peter",
"lastName":"Jones"}
]
}



On Thu, Jul 7, 2016 at 1:47 AM, Hyukjin Kwon  wrote:

The link uses wholeTextFiles() API which treats each file as each record.


2016-07-07 15:42 GMT+09:00 Jörn Franke :

This does not need necessarily the case if you look at the Hadoop
FileInputFormat architecture then you can even split large multi line Jsons
without issues. I would need to have a look at it, but one large file does
not mean one Executor independent of the underlying format.

On 07 Jul 2016, at 08:12, Hyukjin Kwon  wrote:

There is a good link for this here,
http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files

If there are a lot of small files, then it would work pretty okay in a
distributed manner, but I am worried if it is single large file.

In this case, this would only work in single executor which I think will
end up with OutOfMemoryException.

Spark JSON data source does not support multi-line JSON as input due to the
limitation of TextInputFormat and LineRecordReader.

You may have to just extract the values after reading it by textFile..
​


2016-07-07 14:48 GMT+09:00 Lan Jiang :

Hi, there

Spark has provided json document processing feature for a long time. In
most examples I see, each line is a json object in the sample file. That is
the easiest case. But how can we process a json document, which does not
conform to this standard format (one line per json object)? Here is the
document I am working on.

First of all, it is multiple lines for one single big json object. The real
file can be as long as 20+ G. Within that one single json object, it
contains many name/value pairs. The name is some kind of id values. The
value is the actual json object that I would like to be part of dataframe.
Is there any way to do that? Appreciate any input.


{
"id1": {
"Title":"title1&qu

Re: Large files with wholetextfile()

2016-07-12 Thread Hyukjin Kwon
Otherwise, please consider using https://github.com/databricks/spark-xml.

Actually, there is a function to find the input file name, which is..

input_file_name function,
https://github.com/apache/spark/blob/5f342049cce9102fb62b4de2d8d8fa691c2e8ac4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L948

This is available from 1.6.0

Please refer , https://github.com/apache/spark/pull/13806 and
https://github.com/apache/spark/pull/13759
​




2016-07-12 22:04 GMT+09:00 Prashant Sharma :

> Hi Baahu,
>
> That should not be a problem, given you allocate sufficient buffer for
> reading.
>
> I was just working on implementing a patch[1] to support the feature for
> reading wholetextfiles in SQL. This can actually be slightly better
> approach, because here we read to offheap memory for holding data(using
> unsafe interface).
>
> 1. https://github.com/apache/spark/pull/14151
>
> Thanks,
>
>
>
> --Prashant
>
>
> On Tue, Jul 12, 2016 at 6:24 PM, Bahubali Jain  wrote:
>
>> Hi,
>> We have a requirement where in we need to process set of xml files, each
>> of the xml files contain several records (eg:
>> 
>>  data of record 1..
>> 
>>
>> 
>> data of record 2..
>> 
>>
>> Expected output is   
>>
>> Since we needed file name as well in output ,we chose wholetextfile() .
>> We had to go against using StreamXmlRecordReader and StreamInputFormat
>> since I could not find a way to retreive the filename.
>>
>> These xml files could be pretty big, occasionally they could reach a size
>> of 1GB.Since contents of each file would be put into a single partition,would
>> such big files be a issue ?
>> The AWS cluster(50 Nodes) that we use is fairly strong , with each
>> machine having memory of around 60GB.
>>
>> Thanks,
>> Baahu
>>
>
>


Re: java.lang.RuntimeException: Unsupported type: vector

2016-07-24 Thread Hyukjin Kwon
I just wonder how your CSV data structure looks like.

If my understanding is correct, is SQL type of the VectorUDT is StructType
and CSV data source does not support ArrayType and StructType.

Anyhow, it seems CSV does not support UDT for now anyway.

https://github.com/apache/spark/blob/e1dc853737fc1739fbb5377ffe31fb2d89935b1f/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L241-L293



2016-07-25 1:50 GMT+09:00 Jean Georges Perrin :

> I try to build a simple DataFrame that can be used for ML
>
>
> SparkConf conf = new SparkConf().setAppName("Simple prediction from Text
> File").setMaster("local");
> SparkContext sc = new SparkContext(conf);
> SQLContext sqlContext = new SQLContext(sc);
>
> sqlContext.udf().register("vectorBuilder", new VectorBuilder(), new
> VectorUDT());
>
> String filename = "data/tuple-data-file.csv";
> StructType schema = new StructType(
> new StructField[] { new StructField("C0", DataTypes.StringType, false,
> Metadata.empty()),
> new StructField("C1", DataTypes.IntegerType, false, Metadata.empty()),
> new StructField("features", new VectorUDT(), false, Metadata.empty()), });
>
> DataFrame df = sqlContext.read().format("com.databricks.spark.csv"
> ).schema(schema).option("header", "false")
> .load(filename);
> df = df.withColumn("label", df.col("C0")).drop("C0");
> df = df.withColumn("value", df.col("C1")).drop("C1");
> df.printSchema();
> Returns:
> root
>  |-- features: vector (nullable = false)
>  |-- label: string (nullable = false)
>  |-- value: integer (nullable = false)
> df.show();
> Returns:
>
> java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
> at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
> at scala.collection.AbstractIterator.to(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/07/24 12:46:01 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> localhost): java.lang.RuntimeException: Unsupported type: vector
> at com.databricks.spark.csv.util.TypeCast$.castTo(TypeCast.scala:76)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:194)
> at
> com.databricks.spark.csv.CsvRelation$$anonfun$buildScan$6.apply(CsvRelation.scala:173)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
> at scala.collection.Iterator$class.foreach(Iterator.scala:742)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
> at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
> at

Re: spark java - convert string to date

2016-07-31 Thread Hyukjin Kwon
I haven't used this by myself but I guess those functions should work.

unix_timestamp()
​


See
https://github.com/apache/spark/blob/480c870644595a71102be6597146d80b1c0816e4/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L2513-L2530



2016-07-31 22:57 GMT+09:00 Tony Lane :

> Any built in function in java with spark to convert string to date more
> efficiently
> or do we just use the standard java techniques
>
> -Tony
>


Re: DataFramesWriter saving DataFrames timestamp in weird format

2016-08-11 Thread Hyukjin Kwon
Do you mind if I ask which format you used to save the data?

I guess you used CSV and there is a related PR open here
https://github.com/apache/spark/pull/14279#issuecomment-237434591



2016-08-12 6:04 GMT+09:00 Jestin Ma :

> When I load in a timestamp column and try to save it immediately without
> any transformations, the output time is unix time with padded 0's until
> there are 16 values.
>
> For example,
> loading in a time of August 3, 2016, 00:36:25 GMT, which is 1470184585 in
> UNIX time, saves as 147018458500.
>
> When I do df.show(), it shows the date format that I pass in (custom
> format), but it saves as I mentioned.
> I tried loading the saved file as a timestamp and it expectedly throws an
> exception, not being able to recognize an invalid time.
>
> Are there any explanations / workarounds for this?
>
> Thank you,
> Jestin
>


Re: Flattening XML in a DataFrame

2016-08-12 Thread Hyukjin Kwon
Hi Sreekanth,

Assuming you are using Spark 1.x,

I believe this code below:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag",
"emp").load("/tmp/sample.xml")
  .selectExpr("manager.id", "manager.name",
"explode(manager.subordinates.clerk) as clerk")
  .selectExpr("id", "name", "clerk.cid", "clerk.cname")
  .show()

would print the results below as you want:

+---++---+-+
| id|name|cid|cname|
+---++---+-+
|  1| foo|  1|  foo|
|  1| foo|  1|  foo|
+---++---+-+

​

I hope this is helpful.

Thanks!




2016-08-13 9:33 GMT+09:00 Sreekanth Jella :

> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>


Re: Flattening XML in a DataFrame

2016-08-16 Thread Hyukjin Kwon
Sorry for late reply.

Currently, the library only supports to load XML documents just as they are.

Do you mind if I ask open an issue with some more explanations here,
https://github.com/databricks/spark-xml/issues?




2016-08-17 7:22 GMT+09:00 Sreekanth Jella :

> Hi Experts,
>
>
>
> Please suggest. Thanks in advance.
>
>
>
> Thanks,
>
> Sreekanth
>
>
>
> *From:* Sreekanth Jella [mailto:srikanth.je...@gmail.com]
> *Sent:* Sunday, August 14, 2016 11:46 AM
> *To:* 'Hyukjin Kwon' 
> *Cc:* 'user @spark' 
> *Subject:* Re: Flattening XML in a DataFrame
>
>
>
> Hi Hyukjin Kwon,
>
> Thank you for reply.
>
> There are several types of XML documents with different schema which needs
> to be parsed and tag names do not know in hand. All we know is the XSD for
> the given XML.
>
> Is it possible to get the same results even when we do not know the xml
> tags like manager.id, manager.name or is it possible to read the tag
> names from XSD and use?
>
> Thanks,
> Sreekanth
>
>
>
> On Aug 12, 2016 9:58 PM, "Hyukjin Kwon"  wrote:
>
> Hi Sreekanth,
>
>
>
> Assuming you are using Spark 1.x,
>
>
>
> I believe this code below:
>
> sqlContext.read.format("com.databricks.spark.xml").option("rowTag", 
> "emp").load("/tmp/sample.xml")
>
>   .selectExpr("manager.id", "manager.name", 
> "explode(manager.subordinates.clerk) as clerk")
>
>   .selectExpr("id", "name", "clerk.cid", "clerk.cname")
>
>   .show()
>
> would print the results below as you want:
>
> +---++---+-+
>
> | id|name|cid|cname|
>
> +---++---+-+
>
> |  1| foo|  1|  foo|
>
> |  1| foo|  1|  foo|
>
> +---++---+-+
>
> ​
>
>
>
> I hope this is helpful.
>
>
>
> Thanks!
>
>
>
>
>
>
>
>
>
> 2016-08-13 9:33 GMT+09:00 Sreekanth Jella :
>
> Hi Folks,
>
>
>
> I am trying flatten variety of XMLs using DataFrames. I’m using spark-xml
> package which is automatically inferring my schema and creating a
> DataFrame.
>
>
>
> I do not want to hard code any column names in DataFrame as I have lot of
> varieties of XML documents and each might be lot more depth of child nodes.
> I simply want to flatten any type of XML and then write output data to a
> hive table. Can you please give some expert advice for the same.
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, clerk.cid, clerk.cname
>
> 1, foo, 2, cname2
>
> 1, foo, 3, cname3
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>
>


Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Hi Efe,

If my understanding is correct, supporting to write/read complex types is
not supported because CSV format can't represent the nested types in its
own format.

I guess supporting them in writing in external CSV is rather a bug.

I think it'd be great if we can write and read back CSV in its own format
but I guess we can't.

Thanks!

On 19 Aug 2016 6:33 a.m., "Efe Selcuk"  wrote:

> We have an application working in Spark 1.6. It uses the databricks csv
> library for the output format when writing out.
>
> I'm attempting an upgrade to Spark 2. When writing with both the native
> DataFrameWriter#csv() method and with first specifying the
> "com.databricks.spark.csv" format (I suspect underlying format is the same
> but I don't know how to verify), I get the following error:
>
> java.lang.UnsupportedOperationException: CSV data source does not support
> struct<[bunch of field names and types]> data type
>
> There are 20 fields, mostly plain strings with a couple of dates. The
> source object is a Dataset[T] where T is a case class with various fields
> The line just looks like: someDataset.write.csv(outputPath)
>
> Googling returned this fairly recent pull request: https://mail-archives
> .apache.org/mod_mbox/spark-commits/201605.mbox/%3C65d35a7
> 2bd05483392857098a2635...@git.apache.org%3E
>
> If I'm reading that correctly, the schema shows that each record has one
> field of this complex struct type? And the validation thinks it's something
> that it can't serialize. I would expect the schema to have a bunch of
> fields in it matching the case class, so maybe there's something I'm
> misunderstanding.
>
> Efe
>


Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, sorry, I should have read this carefully. Do you mind if I ask your
codes to test?

I would like to reproduce.


I just tested this by myself but I couldn't reproduce as below (is this
what your doing, right?):

case class ClassData(a: String, b: Date)

val ds: Dataset[ClassData] = Seq(
  ("a", Date.valueOf("1990-12-13")),
  ("a", Date.valueOf("1990-12-13")),
  ("a", Date.valueOf("1990-12-13"))
).toDF("a", "b").as[ClassData]
ds.write.csv("/tmp/data.csv")
spark.read.csv("/tmp/data.csv").show()

prints as below:

+---++
|_c0| _c1|
+---++
|  a|7651|
|  a|7651|
|  a|7651|
+---++

​

2016-08-19 9:27 GMT+09:00 Efe Selcuk :

> Thanks for the response. The problem with that thought is that I don't
> think I'm dealing with a complex nested type. It's just a dataset where
> every record is a case class with only simple types as fields, strings and
> dates. There's no nesting.
>
> That's what confuses me about how it's interpreting the schema. The schema
> seems to be one complex field rather than a bunch of simple fields.
>
> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon  wrote:
>
>> Hi Efe,
>>
>> If my understanding is correct, supporting to write/read complex types is
>> not supported because CSV format can't represent the nested types in its
>> own format.
>>
>> I guess supporting them in writing in external CSV is rather a bug.
>>
>> I think it'd be great if we can write and read back CSV in its own format
>> but I guess we can't.
>>
>> Thanks!
>>
>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk"  wrote:
>>
>>> We have an application working in Spark 1.6. It uses the databricks csv
>>> library for the output format when writing out.
>>>
>>> I'm attempting an upgrade to Spark 2. When writing with both the native
>>> DataFrameWriter#csv() method and with first specifying the
>>> "com.databricks.spark.csv" format (I suspect underlying format is the same
>>> but I don't know how to verify), I get the following error:
>>>
>>> java.lang.UnsupportedOperationException: CSV data source does not
>>> support struct<[bunch of field names and types]> data type
>>>
>>> There are 20 fields, mostly plain strings with a couple of dates. The
>>> source object is a Dataset[T] where T is a case class with various fields
>>> The line just looks like: someDataset.write.csv(outputPath)
>>>
>>> Googling returned this fairly recent pull request: https://mail-
>>> archives.apache.org/mod_mbox/spark-commits/201605.mbox/%
>>> 3c65d35a72bd05483392857098a2635...@git.apache.org%3E
>>>
>>> If I'm reading that correctly, the schema shows that each record has one
>>> field of this complex struct type? And the validation thinks it's something
>>> that it can't serialize. I would expect the schema to have a bunch of
>>> fields in it matching the case class, so maybe there's something I'm
>>> misunderstanding.
>>>
>>> Efe
>>>
>>


Re: [Spark2] Error writing "complex" type to CSV

2016-08-18 Thread Hyukjin Kwon
Ah, BTW, there is an issue, SPARK-16216, about printing dates and
timestamps here. So please ignore the integer values for dates

2016-08-19 9:54 GMT+09:00 Hyukjin Kwon :

> Ah, sorry, I should have read this carefully. Do you mind if I ask your
> codes to test?
>
> I would like to reproduce.
>
>
> I just tested this by myself but I couldn't reproduce as below (is this
> what your doing, right?):
>
> case class ClassData(a: String, b: Date)
>
> val ds: Dataset[ClassData] = Seq(
>   ("a", Date.valueOf("1990-12-13")),
>   ("a", Date.valueOf("1990-12-13")),
>   ("a", Date.valueOf("1990-12-13"))
> ).toDF("a", "b").as[ClassData]
> ds.write.csv("/tmp/data.csv")
> spark.read.csv("/tmp/data.csv").show()
>
> prints as below:
>
> +---++
> |_c0| _c1|
> +---++
> |  a|7651|
> |  a|7651|
> |  a|7651|
> +---++
>
> ​
>
> 2016-08-19 9:27 GMT+09:00 Efe Selcuk :
>
>> Thanks for the response. The problem with that thought is that I don't
>> think I'm dealing with a complex nested type. It's just a dataset where
>> every record is a case class with only simple types as fields, strings and
>> dates. There's no nesting.
>>
>> That's what confuses me about how it's interpreting the schema. The
>> schema seems to be one complex field rather than a bunch of simple fields.
>>
>> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon  wrote:
>>
>>> Hi Efe,
>>>
>>> If my understanding is correct, supporting to write/read complex types
>>> is not supported because CSV format can't represent the nested types in its
>>> own format.
>>>
>>> I guess supporting them in writing in external CSV is rather a bug.
>>>
>>> I think it'd be great if we can write and read back CSV in its own
>>> format but I guess we can't.
>>>
>>> Thanks!
>>>
>>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk"  wrote:
>>>
>>>> We have an application working in Spark 1.6. It uses the databricks csv
>>>> library for the output format when writing out.
>>>>
>>>> I'm attempting an upgrade to Spark 2. When writing with both the native
>>>> DataFrameWriter#csv() method and with first specifying the
>>>> "com.databricks.spark.csv" format (I suspect underlying format is the same
>>>> but I don't know how to verify), I get the following error:
>>>>
>>>> java.lang.UnsupportedOperationException: CSV data source does not
>>>> support struct<[bunch of field names and types]> data type
>>>>
>>>> There are 20 fields, mostly plain strings with a couple of dates. The
>>>> source object is a Dataset[T] where T is a case class with various fields
>>>> The line just looks like: someDataset.write.csv(outputPath)
>>>>
>>>> Googling returned this fairly recent pull request:
>>>> https://mail-archives.apache.org/mod_mbox/spark-
>>>> commits/201605.mbox/%3C65d35a72bd05483392857098a2635cc2@git.
>>>> apache.org%3E
>>>>
>>>> If I'm reading that correctly, the schema shows that each record has
>>>> one field of this complex struct type? And the validation thinks it's
>>>> something that it can't serialize. I would expect the schema to have a
>>>> bunch of fields in it matching the case class, so maybe there's something
>>>> I'm misunderstanding.
>>>>
>>>> Efe
>>>>
>>>
>


Re: what is the difference between coalese() and repartition() ?Re: trouble understanding data frame memory usage ³java.io.IOException: Unable to acquire memory²

2015-12-28 Thread Hyukjin Kwon
Hi Andy,

This link explains the difference well.

https://bzhangusc.wordpress.com/2015/08/11/repartition-vs-coalesce/

Simply the difference is whether it "repartitions" partitions or not.

Actually coalesce() with suffering performs exactly woth repartition().
On 29 Dec 2015 08:10, "Andy Davidson"  wrote:

> Hi Michael
>
> I’ll try 1.6 and report back.
>
> The java doc does not say much about coalesce() or repartition(). When I
> use reparation() just before I save my output everything runs as expected
>
> I though coalesce() is an optimized version of reparation() and should be
> used when ever we know we are reducing the number of partitions.
>
> Kind regards
>
> Andy
>
> From: Michael Armbrust 
> Date: Monday, December 28, 2015 at 2:41 PM
> To: Andrew Davidson 
> Cc: "user @spark" 
> Subject: Re: trouble understanding data frame memory usage
> ³java.io.IOException: Unable to acquire memory²
>
> Unfortunately in 1.5 we didn't force operators to spill when ran out of
> memory so there is not a lot you can do.  It would be awesome if you could
> test with 1.6 and see if things are any better?
>
> On Mon, Dec 28, 2015 at 2:25 PM, Andy Davidson <
> a...@santacruzintegration.com> wrote:
>
>> I am using spark 1.5.1. I am running into some memory problems with a
>> java unit test. Yes I could fix it by setting –Xmx (its set to 1024M) how
>> ever I want to better understand what is going on so I can write better
>> code in the future. The test runs on a Mac, master="Local[2]"
>>
>> I have a java unit test that starts by reading a 672K ascii file. I my
>> output data file is 152K. Its seems strange that such a small amount of
>> data would cause an out of memory exception. I am running a pretty standard
>> machine learning process
>>
>>
>>1. Load data
>>2. create a ML pipeline
>>3. transform the data
>>4. Train a model
>>5. Make predictions
>>6. Join the predictions back to my original data set
>>7. Coalesce(1), I only have a small amount of data and want to save
>>it in a single file
>>8. Save final results back to disk
>>
>>
>> Step 7: I am unable to call Coalesce() “java.io.IOException: Unable to
>> acquire memory”
>>
>> To try and figure out what is going I put log messages in to count the
>> number of partitions
>>
>> Turns out I have 20 input files, each one winds up in a separate
>> partition. Okay so after loading I call coalesce(1) and check to make sure
>> I only have a single partition.
>>
>> The total number of observations is 1998.
>>
>> After calling step 7 I count the number of partitions and discovered I
>> have 224 partitions!. Surprising given I called Coalesce(1) before I
>> did anything with the data. My data set should easily fit in memory. When I
>> save them to disk I get 202 files created with 162 of them being empty!
>>
>> In general I am not explicitly using cache.
>>
>> Some of the data frames get registered as tables. I find it easier to use
>> sql.
>>
>> Some of the data frames get converted back to RDDs. I find it easier to
>> create RDD this way
>>
>> I put calls to unpersist(true). In several places
>>
>>private void memoryCheck(String name) {
>>
>> Runtime rt = Runtime.getRuntime();
>>
>> logger.warn("name: {} \t\ttotalMemory: {} \tfreeMemory: {}
>> df.size: {}",
>>
>> name,
>>
>> String.format("%,d", rt.totalMemory()),
>>
>> String.format("%,d", rt.freeMemory()));
>>
>> }
>>
>> Any idea how I can get a better understanding of what is going on? My
>> goal is to learn to write better spark code.
>>
>> Kind regards
>>
>> Andy
>>
>> Memory usages at various points in my unit test
>>
>> name: rawInput totalMemory:   447,741,952 freeMemory:   233,203,184
>>
>> name: naiveBayesModel totalMemory:   509,083,648 freeMemory:
>> 403,504,128
>>
>> name: lpRDD totalMemory:   509,083,648 freeMemory:   402,288,104
>>
>> name: results totalMemory:   509,083,648 freeMemory:   368,011,008
>>
>>
>>DataFrame exploreDF = results.select(results.col("id"),
>>
>> results.col("label"),
>>
>> results.col("binomialLabel"),
>>
>>
>> results.col("labelIndex"),
>>
>>
>> results.col("prediction"),
>>
>>
>> results.col("words"));
>>
>> exploreDF.show(10);
>>
>>
>>
>> Yes I realize its strange to switch styles how ever this should not cause
>> memory problems
>>
>>
>> final String exploreTable = "exploreTable";
>>
>> exploreDF.registerTempTable(exploreTable);
>>
>> String fmt = "SELECT * FROM %s where binomialLabel = ’signal'";
>>
>> String stmt = String.format(fmt, exploreTable);
>>
>>
>> DataFrame subsetToSave = sqlContext.sql(stmt);// .show(100);
>>
>>
>> name: subsetToSave totalMemory: 1,747,451,904 freeMemory: 1,049,447,144
>>
>>
>> exploreDF.unpersist(true

Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-28 Thread Hyukjin Kwon
Hi Divya,

Are you using or have you tried Spark CSV datasource
https://github.com/databricks/spark-csv ?

Thanks!


2015-12-28 18:42 GMT+09:00 Divya Gehlot :

> Hi,
> I have input data set which is CSV file where I have date columns.
> My output will also be CSV file and will using this output CSV  file as
> for hive table creation.
> I have few queries :
> 1.I tried using custom schema using Timestamp but it is returning empty
> result set when querying the dataframes.
> 2.Can I use String datatype in Spark for date column and while creating
> table can define it as date type ? Partitioning of my hive table will be
> date column.
>
> Would really  appreciate if you share some sample code for timestamp in
> Dataframe whereas same can be used while creating the hive table.
>
>
>
> Thanks,
> Divya
>


Re: Timestamp datatype in dataframe + Spark 1.4.1

2015-12-29 Thread Hyukjin Kwon
44. lonadepodf: org.apache.spark.sql.DataFrame = [COLUMN1: string, 
> COLUMN2: string, COLUMN3: timestamp, COLUMN4: timestamp, COLUMN5: string, 
> COLUMN6: string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int, 
> COLUMN11: int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15: 
> string, COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19: 
> string, COLUMN20: string, COLUMN21: string, COLUMN22: string]
>45.
>46. scala> lonadepodf.select("COLUMN1").show(10)
>47. 15/12/28 03:38:01 INFO MemoryStore: ensureFreeSpace(216384) called 
> with curMem=0, maxMem=278302556
>48. 15/12/28 03:38:01 INFO MemoryStore: Block broadcast_0 stored as values 
> in memory (estimated size 211.3 KB, free 265.2 MB)
>49. 
> ...
>50. 15/12/28 03:38:07 INFO DAGScheduler: ResultStage 2 (show at 
> :33) finished in 0.653 s
>51. 15/12/28 03:38:07 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks 
> have all completed, from pool
>52. 15/12/28 03:38:07 INFO DAGScheduler: Job 2 finished: show at 
> :33, took 0.669388 s
>53. +---+
>54. |COLUMN1|
>55. +---+
>56. +---+
>
> Once Timestamp StructField is removed . Result set is returned
>
>
>1. scala> val loandepoSchema = StructType(Seq(
>2.  | StructField("COLUMN1", StringType, true),
>3.  | StructField("COLUMN2", StringType  , true),
>4.  | StructField("COLUMN3", StringType , true),
>5.  | StructField("COLUMN4", StringType , true),
>6.  | StructField("COLUMN5", StringType , true),
>7.  | StructField("COLUMN6", StringType, true),
>8.  | StructField("COLUMN7", IntegerType, true),
>9.  | StructField("COLUMN8", IntegerType, true),
>10.  | StructField("COLUMN9", StringType, true),
>11.  | StructField("COLUMN10", IntegerType, true),
>12.  | StructField("COLUMN11", IntegerType, true),
>13.  | StructField("COLUMN12", IntegerType, true),
>14.  | StructField("COLUMN13", StringType, true),
>15.  | StructField("COLUMN14", StringType, true),
>16.  | StructField("COLUMN15", StringType, true),
>17.  | StructField("COLUMN16", StringType, true),
>18.  | StructField("COLUMN17", StringType, true),
>19.  | StructField("COLUMN18", StringType, true),
>20.  | StructField("COLUMN19", StringType, true),
>21.  | StructField("COLUMN20", StringType, true),
>22.  | StructField("COLUMN21", StringType, true),
>23.  | StructField("COLUMN22", StringType, true)))
>24. loandepoSchema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(COLUMN1,StringType,true), 
> StructField(COLUMN2,StringType,true), StructField(COLUMN3,StringType,true), 
> StructField(COLUMN4,StringType,true), StructField(COLUMN5,StringType,true), 
> StructField(COLUMN6,StringType,true), StructField(COLUMN7,IntegerType,true), 
> StructField(COLUMN8,IntegerType,true), StructField(COLUMN9,StringType,true), 
> StructField(COLUMN10,IntegerType,true), 
> StructField(COLUMN11,IntegerType,true), 
> StructField(COLUMN12,IntegerType,true), 
> StructField(COLUMN13,StringType,true), StructField(COLUMN14,StringType,true), 
> StructField(COLUMN15,StringType,true), StructField(COLUMN16,StringType,true), 
> StructField(COLUMN17,StringType,true), StructField(COLUMN18,StringType,true), 
> StructField(COLUMN19,StringType,...
>25. scala> val lonadepodf = 
> hiveContext.read.format("com.databricks.spark.csv").option("header", 
> "true").schema(loandepoSchema).load("/tmp/TestDivya/loandepo_10K.csv")
>26. lonadepodf: org.apache.spark.sql.DataFrame = [COLUMN1: string, 
> COLUMN2: string, COLUMN3: string, COLUMN4: string, COLUMN5: string, COLUMN6: 
> string, COLUMN7: int, COLUMN8: int, COLUMN9: string, COLUMN10: int, COLUMN11: 
> int, COLUMN12: int, COLUMN13: string, COLUMN14: string, COLUMN15: string, 
> COLUMN16: string, COLUMN17: string, COLUMN18: string, COLUMN19: string, 
> COLUMN20: string, COLUMN21: string, COLUMN22: string]
>27.
>28. scala> lonadepodf.select("COLUMN1").show(10)
>29. 15/12/28 03:39:48 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 
> 172.31.20.85:40013 in memory (size: 4.2 KB, free: 265.3 MB)
>30.
>31. 15/12/28 03:39:49 INFO YarnScheduler: Removed TaskSet 6.0, whose tasks 
> have all completed, from pool
>32. 15/12/28 03:39:49 INFO D

Re: NA value handling in sparkR

2016-01-27 Thread Hyukjin Kwon
Hm.. As far as I remember, you can set the value to treat as null with
*nullValue* option. Although I am hitting network issues with Github so I
can't check this now but please try that option as described in
https://github.com/databricks/spark-csv.

2016-01-28 0:55 GMT+09:00 Felix Cheung :

> That's correct - and because spark-csv as Spark package is not
> specifically aware of R's notion of  NA and interprets it as a string value.
>
> On the other hand, R native NA is converted to NULL on Spark when creating
> a Spark DataFrame from a R data.frame.
> https://eradiating.wordpress.com/2016/01/04/whats-new-in-sparkr-1-6-0/
>
>
>
> _
> From: Devesh Raj Singh 
> Sent: Wednesday, January 27, 2016 3:19 AM
> Subject: Re: NA value handling in sparkR
> To: Deborah Siegel 
> Cc: 
>
>
>
> Hi,
>
> While dealing with missing values with R and SparkR I observed the
> following. Please tell me if I am right or wrong?
>
>
> Missing values in native R are represented with a logical constant-NA.
> SparkR DataFrames represents missing values with NULL. If you use
> createDataFrame() to turn a local R data.frame into a distributed SparkR
> DataFrame, SparkR will automatically convert NA to NULL.
>
> However, if you are creating a SparkR
> DataFrame by reading in data from a file using read.df(), you may have
> strings of "NA", but not R logical constant NA missing value
> representations. String "NA" is not automatically converted to NULL.
>
> On Tue, Jan 26, 2016 at 2:07 AM, Deborah Siegel 
> wrote:
>
>> Maybe not ideal, but since read.df is inferring all columns from the csv
>> containing "NA" as type of strings, one could filter them rather than using
>> dropna().
>>
>> filtered_aq <- filter(aq, aq$Ozone != "NA" & aq$Solar_R != "NA")
>> head(filtered_aq)
>>
>> Perhaps it would be better to have an option for read.df to convert any
>> "NA" it encounters into null types, like createDataFrame does for , and
>> then one would be able to use dropna() etc.
>>
>>
>>
>> On Mon, Jan 25, 2016 at 3:24 AM, Devesh Raj Singh > > wrote:
>>
>>> Hi,
>>>
>>> Yes you are right.
>>>
>>> I think the problem is with reading of csv files. read.df is not
>>> considering NAs in the CSV file
>>>
>>> So what would be a workable solution in dealing with NAs in csv files?
>>>
>>>
>>>
>>> On Mon, Jan 25, 2016 at 2:31 PM, Deborah Siegel <
>>> deborah.sie...@gmail.com> wrote:
>>>
 Hi Devesh,

 I'm not certain why that's happening, and it looks like it doesn't
 happen if you use createDataFrame directly:
 aq <- createDataFrame(sqlContext,airquality)
 head(dropna(aq,how="any"))

 If I had to guess.. dropna(), I believe, drops null values. I suppose
 its possible that createDataFrame converts R's  values to null, so
 dropna() works with that. But perhaps read.df() does not convert R s to
 null, as those are most likely interpreted as strings when they come in
 from the csv. Just a guess, can anyone confirm?

 Deb






 On Sun, Jan 24, 2016 at 11:05 PM, Devesh Raj Singh <
 raj.deves...@gmail.com> wrote:

> Hi,
>
> I have applied the following code on airquality dataset available in R
> , which has some missing values. I want to omit the rows which has NAs
>
> library(SparkR) Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages"
> "com.databricks:spark-csv_2.10:1.2.0" "sparkr-shell"')
>
> sc <- sparkR.init("local",sparkHome =
> "/Users/devesh/Downloads/spark-1.5.1-bin-hadoop2.6")
>
> sqlContext <- sparkRSQL.init(sc)
>
> path<-"/Users/devesh/work/airquality/"
>
> aq <- read.df(sqlContext,path,source = "com.databricks.spark.csv",
> header="true", inferSchema="true")
>
> head(dropna(aq,how="any"))
>
> I am getting the output as
>
> Ozone Solar_R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72
> 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28
> NA 14.9 66 5 6
>
> The NAs still exist in the output. Am I missing something here?
>
> --
> Warm regards,
> Devesh.
>


>>>
>>>
>>> --
>>> Warm regards,
>>> Devesh.
>>>
>>
>>
>
>
> --
> Warm regards,
> Devesh.
>
>
>


Re: Reading lzo+index with spark-csv (Splittable reads)

2016-01-31 Thread Hyukjin Kwon
Hm.. As I said here
https://github.com/databricks/spark-csv/issues/245#issuecomment-177682354,

It sounds reasonable in a way though. For me, this might be to deal with
some narrow use-cases.

How about using csvRdd(),
https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/CsvParser.scala#L143-L162
?

I think you can do this like below:


val rdd = sc.newAPIHadoopFile("/file.csv.lzo",
classOf[com.hadoop.mapreduce.LzoTextInputFormat],
classOf[org.apache.hadoop.io.LongWritable],
classOf[org.apache.hadoop.io.Text])
val df = new CsvParser()
  .csvRdd(sqlContext, rdd)



2016-01-30 10:04 GMT+09:00 syepes :

> Well looking at the src it look like its not implemented:
>
>
> https://github.com/databricks/spark-csv/blob/master/src/main/scala/com/databricks/spark/csv/util/TextFile.scala#L34-L36
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Reading-lzo-index-with-spark-csv-Splittable-reads-tp26103p26105.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: spark-xml data source (com.databricks.spark.xml) not working with spark 1.6

2016-02-25 Thread Hyukjin Kwon
Hi,

it looks you forgot to specify the "rowTag" option, which is "book" for the
case of the sample data.

Thanks

2016-01-29 8:16 GMT+09:00 Andrés Ivaldi :

> Hi, could you get it work, tomorrow I'll be using the xml parser also, On
> windows 7, I'll let you know the results.
>
> Regards,
>
>
>
> On Thu, Jan 28, 2016 at 12:27 PM, Deenar Toraskar <
> deenar.toras...@gmail.com> wrote:
>
>> Hi
>>
>> Anyone tried using spark-xml with spark 1.6. I cannot even get the sample
>> book.xml file (wget
>> https://github.com/databricks/spark-xml/raw/master/src/test/resources/books.xml
>> ) working
>> https://github.com/databricks/spark-xml
>>
>> scala> val df =
>> sqlContext.read.format("com.databricks.spark.xml").load("books.xml")
>>
>>
>> scala> df.count
>>
>> res4: Long = 0
>>
>>
>> Anyone else facing the same issue?
>>
>>
>> Deenar
>>
>
>
>
> --
> Ing. Ivaldi Andres
>


Fixed writer version as version1 for Parquet as wring a Parquet file.

2015-10-08 Thread Hyukjin Kwon
Hi all,

While wring some parquet files by Spark, I found it actually only writes
the parquet files with writer version1.

This differs encoding types of the file.

Is this intendedly fixed for some reasons?


I changed codes and tested to write this as writer version2 and it looks
fine.

In more details,
I found it fixes the writer version in
org.apache.spark.sql.execution.datasources.parquet.CatalystWriteSupport.scala

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_1_0.toString)
}

​

I changed this to this in order to keep the given configuration

def setSchema(schema: StructType, configuration: Configuration): Unit = {
  schema.map(_.name).foreach(CatalystSchemaConverter.checkFieldName)
  configuration.set(SPARK_ROW_SCHEMA, schema.json)
  configuration.set(
ParquetOutputFormat.WRITER_VERSION,
configuration.get(ParquetOutputFormat.WRITER_VERSION,
  ParquetProperties.WriterVersion.PARQUET_1_0.toString)
  )
}

​

and set the version to version2

sc.hadoopConfiguration.set(ParquetOutputFormat.WRITER_VERSION,
ParquetProperties.WriterVersion.PARQUET_2_0.toString)

​


Filter applied on merged Parquet shemsa with new column fails.

2015-10-27 Thread Hyukjin Kwon
When enabling mergedSchema and predicate filter, this fails since Parquet
filters are pushed down regardless of each schema of the splits (or rather
files).

Dominic Ricard reported this issue (
https://issues.apache.org/jira/browse/SPARK-11103)

Even though this would work okay by setting spark.sql.parquet.filterPushdown
to false, the default value of this is true. So this looks an issue.

My questions are,
is this clearly an issue?
and if so, which way would this be handled?


I thought this is an issue and I made three rough patches for this and
tested them and this looks fine though.

The first approach looks simpler and appropriate as I presume from the
previous approaches such as
https://issues.apache.org/jira/browse/SPARK-11153

However, in terms of safety and performances, I also want to ensure which
one would be a proper approach before trying to open a PR.

1. Simply set false to spark.sql.parquet.filterPushdown when using
mergeSchema

2. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and check if each can accept the
given schema and then, apply the filter only when they all can accept,
which I think it's a bit over-implemented.

3. If spark.sql.parquet.filterPushdown is true, retrieve all the schema of
every part-files (and also merged one) and apply the filter to each split
(rather file) that can accept the filter which (I think it's hacky) ends up
different configurations for each task in a job.


Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Hi all,

I am writing this email to both user-group and dev-group since this is
applicable to both.

I am now working on Spark XML datasource (
https://github.com/databricks/spark-xml).
This uses a InputFormat implementation which I downgraded to Hadoop 1.x for
version compatibility.

However, I found all the internal JSON datasource and others in Databricks
use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
method for this because TaskAttemptContext is a class in Hadoop 1.x and an
interface in Hadoop 2.x.

So, I looked through the codes for some advantages for Hadoop 2.x API but I
couldn't.
I wonder if there are some advantages for using Hadoop 2.x API.

I understand that it is still preferable to use Hadoop 2.x APIs at least
for future differences but somehow I feel like it might not have to use
Hadoop 2.x by reflecting a method.

I would appreciate that if you leave a comment here
https://github.com/databricks/spark-xml/pull/14 as well as sending back a
reply if there is a good explanation

Thanks!


Re: Differences between Spark APIs for Hadoop 1.x and Hadoop 2.x in terms of performance, progress reporting and IO metrics.

2015-12-09 Thread Hyukjin Kwon
Thank you for your reply!

I have already done the change locally. So for changing it would be fine.

I just wanted to be sure which way is correct.
On 9 Dec 2015 18:20, "Fengdong Yu"  wrote:

> I don’t think there is performance difference between 1.x API and 2.x API.
>
> but it’s not a big issue for your change, only
> com.databricks.hadoop.mapreduce.lib.input.XmlInputFormat.java
> <https://github.com/databricks/spark-xml/blob/master/src/main/java/com/databricks/hadoop/mapreduce/lib/input/XmlInputFormat.java>
>  need to change, right?
>
> It’s not a big change to 2.x API. if you agree, I can do, but I cannot
> promise the time within one or two weeks because of my daily job.
>
>
>
>
>
> On Dec 9, 2015, at 5:01 PM, Hyukjin Kwon  wrote:
>
> Hi all,
>
> I am writing this email to both user-group and dev-group since this is
> applicable to both.
>
> I am now working on Spark XML datasource (
> https://github.com/databricks/spark-xml).
> This uses a InputFormat implementation which I downgraded to Hadoop 1.x
> for version compatibility.
>
> However, I found all the internal JSON datasource and others in Databricks
> use Hadoop 2.x API dealing with TaskAttemptContextImpl by reflecting the
> method for this because TaskAttemptContext is a class in Hadoop 1.x and an
> interface in Hadoop 2.x.
>
> So, I looked through the codes for some advantages for Hadoop 2.x API but
> I couldn't.
> I wonder if there are some advantages for using Hadoop 2.x API.
>
> I understand that it is still preferable to use Hadoop 2.x APIs at least
> for future differences but somehow I feel like it might not have to use
> Hadoop 2.x by reflecting a method.
>
> I would appreciate that if you leave a comment here
> https://github.com/databricks/spark-xml/pull/14 as well as sending back a
> reply if there is a good explanation
>
> Thanks!
>
>
>


Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam,

I have a plan to contribute some codes about passing filters to a
datasource as physical planning.

In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory with Spark operations), we need to implement

PrunedFilteredScan, PrunedScan or CatalystScan in package
org.apache.spark.sql.sources.



For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
org.apache.spark.sql.sources, which do not access directly to the query
parser but are objects built by selectFilters() in package
org.apache.spark.sql.sources.DataSourceStrategy.

It looks all the filters (rather raw expressions) do not pass to the
function below in PrunedFilteredScan and PrunedScan.

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

The passing filters in here are defined in package
org.apache.spark.sql.sources.

On the other hand, it does not pass EqualNullSafe filter in package
org.apache.spark.sql.catalyst.expressions even though this looks possible
to pass for other datasources such as Parquet and JSON.



I understand that  CatalystScan can take the all raw expression accessing
to the query planner. However, it is experimental and also it needs
different interfaces (as well as unstable for the reasons such as binary
capability).

As far as I know, Parquet also does not use this.



In general, this can be a issue as a user send a query to data such as

1.

SELECT *
FROM table
WHERE field = 1;


2.

SELECT *
FROM table
WHERE field <=> 1;


The second query can be hugely slow because of large network traffic by not
filtered data from the source RDD.



Also,I could not find a proper issue for this (except for
https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
now binary capability.

Accordingly, I want to add this issue and make a pull request with my codes.


Could you please make any comments for this?

Thanks.


Re: Best way to read XML data from RDD

2016-08-21 Thread Hyukjin Kwon
Hi Diwakar,

Spark XML library can take RDD as source.

```
val df = new XmlReader()
  .withRowTag("book")
  .xmlRdd(sqlContext, rdd)
```

If performance is critical, I would also recommend to take care of creation
and destruction of the parser.

If the parser is not serializble, then you can do the creation for each
partition within mapPartition just like

https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9bb188140423/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L322-L325

I hope this is helpful.



2016-08-20 15:10 GMT+09:00 Jörn Franke :

> I fear the issue is that this will create and destroy a XML parser object
> 2 mio times, which is very inefficient - it does not really look like a
> parser performance issue. Can't you do something about the format choice?
> Ask your supplier to deliver another format (ideally avro or sth like
> this?)?
> Otherwise you could just create one XML Parser object / node, but sharing
> this among the parallel tasks on the same node is tricky.
> The other possibility could be simply more hardware ...
>
> On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi 
> wrote:
>
> Yes . It accepts a xml file as source but not RDD. The XML data embedded
>  inside json is streamed from kafka cluster.  So I could get it as RDD.
> Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map
> function  but  performance  wise I am not happy as it takes 4 minutes to
> parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
>
>
> Sent from Samsung Mobile.
>
>
>  Original message 
> From: Felix Cheung 
> Date:20/08/2016 09:49 (GMT+05:30)
> To: Diwakar Dhanuskodi , user <
> user@spark.apache.org>
> Cc:
> Subject: Re: Best way to read XML data from RDD
>
> Have you tried
>
> https://github.com/databricks/spark-xml
> ?
>
>
>
>
> On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <
> diwakar.dhanusk...@gmail.com> wrote:
>
> Hi,
>
> There is a RDD with json data. I could read json data using rdd.read.json
> . The json data has XML data in couple of key-value paris.
>
> Which is the best method to read and parse XML from rdd. Is there any
> specific xml libraries for spark. Could anyone help on this.
>
> Thanks.
>
>


Re: Entire XML data as one of the column in DataFrame

2016-08-21 Thread Hyukjin Kwon
I can't say this is the best way to do so but my instant thought is as
below:


Create two df

sc.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"")
sc.hadoopConfiguration.set(XmlInputFormat.ENCODING_KEY, "UTF-8")
val strXmlDf = sc.newAPIHadoopFile(carsFile,
  classOf[XmlInputFormat],
  classOf[LongWritable],
  classOf[Text]).map { pair =>
new String(pair._2.getBytes, 0, pair._2.getLength)
  }.toDF("XML")

val xmlDf = sqlContext.read.format("xml")
  .option("rowTag", "emplist")
  .load(path)

​

zip those two maybe like this https://github.com/apache/spark/pull/7474


and then starts to filter with emp.id or emp.name.



2016-08-22 5:31 GMT+09:00 :

> Hello Experts,
>
>
>
> I’m using spark-xml package which is automatically inferring my schema and
> creating a DataFrame.
>
>
>
> I’m extracting few fields like id, name (which are unique) from below xml,
> but my requirement is to store entire XML in one of the column as well. I’m
> writing this data to AVRO hive table. Can anyone tell me how to achieve
> this?
>
>
>
> Example XML and expected output is given below.
>
>
>
> Sample XML:
>
> 
>
> 
>
>
>
>1
>
>foo
>
> 
>
>   
>
> 1
>
> foo
>
>   
>
>   
>
> 1
>
> foo
>
>   
>
> 
>
>
>
> 
>
> 
>
>
>
> Expected output:
>
> id, name, XML
>
> 1, foo,  ….
>
>
>
> Thanks,
>
> Sreekanth Jella
>
>
>
>
>


Re: [Spark2] Error writing "complex" type to CSV

2016-08-22 Thread Hyukjin Kwon
Whether it writes the data as garbage or string representation, this is not
able to load back. So, I'd say both are wrong and bugs.

I think it'd be great if we can write and read back CSV in its own format
but I guess we can't for now.


2016-08-20 2:54 GMT+09:00 Efe Selcuk :

> Okay so this is partially PEBKAC. I just noticed that there's a debugging
> field at the end that's another case class with its own simple fields -
> *that's* the struct that was showing up in the error, not the entry
> itself.
>
> This raises a different question. What has changed that this is no longer
> possible? The pull request said that it prints garbage. Was that some
> regression in 2.0? The same code prints fine in 1.6.1. The field prints as
> an array of the values of its fields.
>
> On Thu, Aug 18, 2016 at 5:56 PM, Hyukjin Kwon  wrote:
>
>> Ah, BTW, there is an issue, SPARK-16216, about printing dates and
>> timestamps here. So please ignore the integer values for dates
>>
>> 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon :
>>
>>> Ah, sorry, I should have read this carefully. Do you mind if I ask your
>>> codes to test?
>>>
>>> I would like to reproduce.
>>>
>>>
>>> I just tested this by myself but I couldn't reproduce as below (is this
>>> what your doing, right?):
>>>
>>> case class ClassData(a: String, b: Date)
>>>
>>> val ds: Dataset[ClassData] = Seq(
>>>   ("a", Date.valueOf("1990-12-13")),
>>>   ("a", Date.valueOf("1990-12-13")),
>>>   ("a", Date.valueOf("1990-12-13"))
>>> ).toDF("a", "b").as[ClassData]
>>> ds.write.csv("/tmp/data.csv")
>>> spark.read.csv("/tmp/data.csv").show()
>>>
>>> prints as below:
>>>
>>> +---++
>>> |_c0| _c1|
>>> +---++
>>> |  a|7651|
>>> |  a|7651|
>>> |  a|7651|
>>> +---++
>>>
>>> ​
>>>
>>> 2016-08-19 9:27 GMT+09:00 Efe Selcuk :
>>>
>>>> Thanks for the response. The problem with that thought is that I don't
>>>> think I'm dealing with a complex nested type. It's just a dataset where
>>>> every record is a case class with only simple types as fields, strings and
>>>> dates. There's no nesting.
>>>>
>>>> That's what confuses me about how it's interpreting the schema. The
>>>> schema seems to be one complex field rather than a bunch of simple fields.
>>>>
>>>> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon  wrote:
>>>>
>>>>> Hi Efe,
>>>>>
>>>>> If my understanding is correct, supporting to write/read complex types
>>>>> is not supported because CSV format can't represent the nested types in 
>>>>> its
>>>>> own format.
>>>>>
>>>>> I guess supporting them in writing in external CSV is rather a bug.
>>>>>
>>>>> I think it'd be great if we can write and read back CSV in its own
>>>>> format but I guess we can't.
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk"  wrote:
>>>>>
>>>>>> We have an application working in Spark 1.6. It uses the databricks
>>>>>> csv library for the output format when writing out.
>>>>>>
>>>>>> I'm attempting an upgrade to Spark 2. When writing with both the
>>>>>> native DataFrameWriter#csv() method and with first specifying the
>>>>>> "com.databricks.spark.csv" format (I suspect underlying format is the 
>>>>>> same
>>>>>> but I don't know how to verify), I get the following error:
>>>>>>
>>>>>> java.lang.UnsupportedOperationException: CSV data source does not
>>>>>> support struct<[bunch of field names and types]> data type
>>>>>>
>>>>>> There are 20 fields, mostly plain strings with a couple of dates. The
>>>>>> source object is a Dataset[T] where T is a case class with various fields
>>>>>> The line just looks like: someDataset.write.csv(outputPath)
>>>>>>
>>>>>> Googling returned this fairly recent pull request:
>>>>>> https://mail-archives.apache.org/mod_mbox/spark-com
>>>>>> mits/201605.mbox/%3c65d35a72bd05483392857098a2635...@git.apa
>>>>>> che.org%3E
>>>>>>
>>>>>> If I'm reading that correctly, the schema shows that each record has
>>>>>> one field of this complex struct type? And the validation thinks it's
>>>>>> something that it can't serialize. I would expect the schema to have a
>>>>>> bunch of fields in it matching the case class, so maybe there's something
>>>>>> I'm misunderstanding.
>>>>>>
>>>>>> Efe
>>>>>>
>>>>>
>>>
>>
>


Re: Best way to read XML data from RDD

2016-08-22 Thread Hyukjin Kwon
Do you mind share your codes and sample data? It should be okay with single
XML if I remember this correctly.

2016-08-22 19:53 GMT+09:00 Diwakar Dhanuskodi 
:

> Hi Darin,
>
> Ate  you  using  this  utility  to  parse single line XML?
>
>
> Sent from Samsung Mobile.
>
>
>  Original message 
> From: Darin McBeath 
> Date:21/08/2016 17:44 (GMT+05:30)
> To: Hyukjin Kwon , Jörn Franke 
>
> Cc: Diwakar Dhanuskodi , Felix Cheung <
> felixcheun...@hotmail.com>, user 
> Subject: Re: Best way to read XML data from RDD
>
> Another option would be to look at spark-xml-utils.  We use this
> extensively in the manipulation of our XML content.
>
> https://github.com/elsevierlabs-os/spark-xml-utils
>
>
>
> There are quite a few examples.  Depending on your preference (and what
> you want to do), you could use xpath, xquery, or xslt to transform,
> extract, or filter.
>
> Like mentioned below, you want to initialize the parser in a mapPartitions
> call (one of the examples shows this).
>
> Hope this is helpful.
>
> Darin.
>
>
>
>
>
> 
> From: Hyukjin Kwon 
> To: Jörn Franke 
> Cc: Diwakar Dhanuskodi ; Felix Cheung <
> felixcheun...@hotmail.com>; user 
> Sent: Sunday, August 21, 2016 6:10 AM
> Subject: Re: Best way to read XML data from RDD
>
>
>
> Hi Diwakar,
>
> Spark XML library can take RDD as source.
>
> ```
> val df = new XmlReader()
>   .withRowTag("book")
>   .xmlRdd(sqlContext, rdd)
> ```
>
> If performance is critical, I would also recommend to take care of
> creation and destruction of the parser.
>
> If the parser is not serializble, then you can do the creation for each
> partition within mapPartition just like
>
> https://github.com/apache/spark/blob/ac84fb64dd85257da06f93a48fed9b
> b188140423/sql/core/src/main/scala/org/apache/spark/sql/
> DataFrameReader.scala#L322-L325
>
>
> I hope this is helpful.
>
>
>
>
> 2016-08-20 15:10 GMT+09:00 Jörn Franke :
>
> I fear the issue is that this will create and destroy a XML parser object
> 2 mio times, which is very inefficient - it does not really look like a
> parser performance issue. Can't you do something about the format choice?
> Ask your supplier to deliver another format (ideally avro or sth like
> this?)?
> >Otherwise you could just create one XML Parser object / node, but sharing
> this among the parallel tasks on the same node is tricky.
> >The other possibility could be simply more hardware ...
> >
> >On 20 Aug 2016, at 06:41, Diwakar Dhanuskodi <
> diwakar.dhanusk...@gmail.com> wrote:
> >
> >
> >Yes . It accepts a xml file as source but not RDD. The XML data embedded
> inside json is streamed from kafka cluster.  So I could get it as RDD.
> >>Right  now  I am using  spark.xml  XML.loadstring method inside  RDD map
> function  but  performance  wise I am not happy as it takes 4 minutes to
> parse XML from 2 million messages in a 3 nodes 100G 4 cpu each environment.
> >>
> >>
> >>
> >>
> >>Sent from Samsung Mobile.
> >>
> >>
> >> Original message 
> >>From: Felix Cheung 
> >>Date:20/08/2016  09:49  (GMT+05:30)
> >>To: Diwakar Dhanuskodi  , user <
> user@spark.apache.org>
> >>Cc:
> >>Subject: Re: Best way to read XML data from RDD
> >>
> >>
> >>Have you tried
> >>
> >>https://github.com/databricks/ spark-xml
> >>?
> >>
> >>
> >>
> >>
> >>
> >>On Fri, Aug 19, 2016 at 1:07 PM -0700, "Diwakar Dhanuskodi" <
> diwakar.dhanusk...@gmail.com> wrote:
> >>
> >>
> >>Hi,
> >>
> >>
> >>There is a RDD with json data. I could read json data using
> rdd.read.json . The json data has XML data in couple of key-value paris.
> >>
> >>
> >>Which is the best method to read and parse XML from rdd. Is there any
> specific xml libraries for spark. Could anyone help on this.
> >>
> >>
> >>Thanks.
>


Re: Spark 2.0 - Parquet data with fields containing periods "."

2016-08-31 Thread Hyukjin Kwon
Hi Don, I guess this should be fixed from 2.0.1.

Please refer this PR. https://github.com/apache/spark/pull/14339

On 1 Sep 2016 2:48 a.m., "Don Drake"  wrote:

> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
> 2.0 and have encountered some interesting issues.
>
> First, it seems the SQL parsing is different, and I had to rewrite some
> SQL that was doing a mix of inner joins (using where syntax, not inner) and
> outer joins to get the SQL to work.  It was complaining about columns not
> existing.  I can't reproduce that one easily and can't share the SQL.  Just
> curious if anyone else is seeing this?
>
> I do have a showstopper problem with Parquet dataset that have fields
> containing a "." in the field name.  This data comes from an external
> provider (CSV) and we just pass through the field names.  This has worked
> flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
> parquet files.
>
> I've reproduced a trivial example below. Jira created: https://issues.
> apache.org/jira/browse/SPARK-17341
>
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>   /_/
>
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
> 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
>
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i *
> i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in
> metastore. hive.metastore.schema.verification is not enabled so recording
> the schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
> returning NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value:
> int]
>
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
>
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further
> details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
> Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
> Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apac

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
Hi Selvam,

If your report is commented with any character (e.g. #), you can skip these
lines via comment option [1].

If you are using Spark 1.x, then you might be able to do this by manually
skipping from the RDD and then making this to DataFrame as below:

I haven’t tested this but I think this should work.

val rdd = sparkContext.textFile("...")
val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
  if (idx == 0) {
iter.drop(10)
  } else {
iter
  }
}
val df = new CsvParser().csvRdd(sqlContext, filteredRdd)

If you are using Spark 2.0, then it seems there is no way to manually
modifying the source data because loading existing RDD or DataSet[String]
to DataFrame is not yet supported.

There is an issue open[2]. I hope this is helpful.

Thanks.

[1]
https://github.com/apache/spark/blob/27209252f09ff73c58e60c6df8aaba73b308088c/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L369
[2] https://issues.apache.org/jira/browse/SPARK-15463


​


On 10 Sep 2016 6:14 p.m., "Selvam Raman"  wrote:

> Hi,
>
> I am using spark csv to read csv file. The issue is my files first n lines
> contains some report and followed by actual data (header and rest of the
> data).
>
> So how can i skip first n lines in spark csv. I dont have any specific
> comment character in the first byte.
>
> Please give me some idea.
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: Spark CSV output

2016-09-10 Thread Hyukjin Kwon
Have you tried the quote related options (e.g. `quote` or `quoteMode`
*https://github.com/databricks/spark-csv/blob/master/README.md#features
)*?

On 11 Sep 2016 12:22 a.m., "ayan guha"  wrote:

> CSV standard uses quote to identify multiline output
> On 11 Sep 2016 01:04, "KhajaAsmath Mohammed" 
> wrote:
>
>> Hi,
>>
>> I am using the package com.databricks.spark.csv to save the dataframe
>> output to hdfs path. I am able to write the output but there are quotations
>> before and after end of the string. Did anyone resolve it when usinig it
>> with com.databricks.spark.csv package.
>>
>> "An account was successfully logged on.|NULL SID|-|-|0x0"
>>
>> Here is sample output that I got with quotations.
>>
>> Thanks,
>> Asmath.
>>
>


Re: Reading a TSV file

2016-09-10 Thread Hyukjin Kwon
Yeap. also, sep is preferred and has a higher precedence than delimiter.
​

2016-09-11 0:44 GMT+09:00 Jacek Laskowski :

> Hi Muhammad,
>
> sep or delimiter should both work fine.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
>
> On Sat, Sep 10, 2016 at 10:42 AM, Muhammad Asif Abbasi
>  wrote:
> > Thanks for responding. I believe i had already given scala example as a
> part
> > of my code in the second email.
> >
> > Just looked at the DataFrameReader code, and it appears the following
> would
> > work in Java.
> >
> > Dataset pricePaidDS = spark.read().option("sep","\t"
> ).csv(fileName);
> >
> > Thanks for your help.
> >
> > Cheers,
> >
> >
> >
> > On Sat, Sep 10, 2016 at 2:49 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> > wrote:
> >>
> >> Read header false not true
> >>
> >>  val df2 = spark.read.option("header",
> >> false).option("delimiter","\t").csv("hdfs://rhes564:9000/
> tmp/nw_10124772.tsv")
> >>
> >>
> >>
> >> Dr Mich Talebzadeh
> >>
> >>
> >>
> >> LinkedIn
> >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCd
> OABUrV8Pw
> >>
> >>
> >>
> >> http://talebzadehmich.wordpress.com
> >>
> >>
> >> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >> loss, damage or destruction of data or any other property which may
> arise
> >> from relying on this email's technical content is explicitly
> disclaimed. The
> >> author will in no case be liable for any monetary damages arising from
> such
> >> loss, damage or destruction.
> >>
> >>
> >>
> >>
> >> On 10 September 2016 at 14:46, Mich Talebzadeh <
> mich.talebza...@gmail.com>
> >> wrote:
> >>>
> >>> This should be pretty straight forward?
> >>>
> >>> You can create a tab separated file from any database table and buck
> copy
> >>> out, MSSQL, Sybase etc
> >>>
> >>>  bcp scratchpad..nw_10124772 out nw_10124772.tsv -c -t '\t' -Usa
> -A16384
> >>> Password:
> >>> Starting copy...
> >>> 441 rows copied.
> >>>
> >>> more nw_10124772.tsv
> >>> Mar 22 2011 12:00:00:000AM  SBT 602424  10124772FUNDS
> >>> TRANSFER , FROM A/C 17904064  200.00  200.00
> >>> Mar 22 2011 12:00:00:000AM  SBT 602424  10124772FUNDS
> >>> TRANSFER , FROM A/C 36226823  454.74  654.74
> >>>
> >>> Put that file into hdfs. Note that it has no headers
> >>>
> >>> Read in as a tsv file
> >>>
> >>> scala> val df2 = spark.read.option("header",
> >>> true).option("delimiter","\t").csv("hdfs://rhes564:9000/tmp/
> nw_10124772.tsv")
> >>> df2: org.apache.spark.sql.DataFrame = [Mar 22 2011 12:00:00:000AM:
> >>> string, SBT: string ... 6 more fields]
> >>>
> >>> scala> df2.first
> >>> res7: org.apache.spark.sql.Row = [Mar 22 2011
> >>> 12:00:00:000AM,SBT,602424,10124772,FUNDS TRANSFER , FROM A/C
> >>> 17904064,200.00,,200.00]
> >>>
> >>> HTH
> >>>
> >>>
> >>> Dr Mich Talebzadeh
> >>>
> >>>
> >>>
> >>> LinkedIn
> >>> https://www.linkedin.com/profile/view?id=
> AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> >>>
> >>>
> >>>
> >>> http://talebzadehmich.wordpress.com
> >>>
> >>>
> >>> Disclaimer: Use it at your own risk. Any and all responsibility for any
> >>> loss, damage or destruction of data or any other property which may
> arise
> >>> from relying on this email's technical content is explicitly
> disclaimed. The
> >>> author will in no case be liable for any monetary damages arising from
> such
> >>> loss, damage or destruction.
> >>>
> >>>
> >>>
> >>>
> >>> On 10 September 2016 at 13:57, Mich Talebzadeh
> >>>  wrote:
> 
>  Thanks Jacek.
> 
>  The old stuff with databricks
> 
>  scala> val df =
>  spark.read.format("com.databricks.spark.csv").option("inferSchema",
>  "true").option("header",
>  "true").load("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>  df: org.apache.spark.sql.DataFrame = [Transaction Date: string,
>  Transaction Type: string ... 7 more fields]
> 
>  Now I can do
> 
>  scala> val df2 = spark.read.option("header",
>  true).csv("hdfs://rhes564:9000/data/stg/accounts/ll/18740868")
>  df2: org.apache.spark.sql.DataFrame = [Transaction Date: string,
>  Transaction Type: string ... 7 more fields]
> 
>  About Schema stuff that apparently Spark works out itself
> 
>  scala> df.printSchema
>  root
>   |-- Transaction Date: string (nullable = true)
>   |-- Transaction Type: string (nullable = true)
>   |-- Sort Code: string (nullable = true)
>   |-- Account Number: integer (nullable = true)
>   |-- Transaction Description: string (nullable = true)
>   |-- Debit Amount: double (nullable = true)
>   |-- Credit Amount: double (nullable = true)
>   |-- Balance: double (nullable = true)
>   |-- _c8: string (nullable = true)
> 
>  scala> df2.printSchema
>  root
>   |-- Transaction Date: string (

Re: Spark CSV skip lines

2016-09-10 Thread Hyukjin Kwon
As you are reading each record as each file via wholeTextFiles and
falttening them to records, I think you can just drop the few lines as you
want.

Can you just drop or skip few lines from reader.readAll().map(...)?

Also, are you sure this is an issue in Spark or external CSV library issue?

Do you mind if I ask the stack trace if you think so?

On 11 Sep 2016 1:50 a.m., "Selvam Raman"  wrote:

> Hi,
>
> I saw this two option already anyway thanks for the idea.
>
> i am using wholetext file to read my data(cause there are  \n middle of
> it) and using opencsv to parse the data. In my data first two lines are
> just some report. how can i eliminate.
>
> *How to eliminate first two lines after reading from wholetextfiles.*
>
> val test = wholeTextFiles.flatMap{ case (_, txt) =>
>  | val reader = new CSVReader(new StringReader(txt));
>  | reader.readAll().map(data => Row(data(3),data(4),data(7),
> data(9),data(14)))}
>
> The above code throws arrayoutofbounce exception for empty line and report
> line.
>
>
> On Sat, Sep 10, 2016 at 3:02 PM, Hyukjin Kwon  wrote:
>
>> Hi Selvam,
>>
>> If your report is commented with any character (e.g. #), you can skip
>> these lines via comment option [1].
>>
>> If you are using Spark 1.x, then you might be able to do this by manually
>> skipping from the RDD and then making this to DataFrame as below:
>>
>> I haven’t tested this but I think this should work.
>>
>> val rdd = sparkContext.textFile("...")
>> val filteredRdd = rdd.mapPartitionsWithIndex { (idx, iter) =>
>>   if (idx == 0) {
>> iter.drop(10)
>>   } else {
>> iter
>>   }
>> }
>> val df = new CsvParser().csvRdd(sqlContext, filteredRdd)
>>
>> If you are using Spark 2.0, then it seems there is no way to manually
>> modifying the source data because loading existing RDD or DataSet[String]
>> to DataFrame is not yet supported.
>>
>> There is an issue open[2]. I hope this is helpful.
>>
>> Thanks.
>>
>> [1] https://github.com/apache/spark/blob/27209252f09ff73c58e
>> 60c6df8aaba73b308088c/sql/core/src/main/scala/org/
>> apache/spark/sql/DataFrameReader.scala#L369
>> [2] https://issues.apache.org/jira/browse/SPARK-15463
>>
>>
>> ​
>>
>>
>> On 10 Sep 2016 6:14 p.m., "Selvam Raman"  wrote:
>>
>>> Hi,
>>>
>>> I am using spark csv to read csv file. The issue is my files first n
>>> lines contains some report and followed by actual data (header and rest of
>>> the data).
>>>
>>> So how can i skip first n lines in spark csv. I dont have any specific
>>> comment character in the first byte.
>>>
>>> Please give me some idea.
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>
>
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Re: take() works on RDD but .write.json() does not work in 2.0.0

2016-09-17 Thread Hyukjin Kwon
Hi Kevin,

I have few questions on this.

Does that only not work with write.json() ? I just wonder if write.text,
csv or another API does not work as well and it is a JSON specific issue.

Also, does that work with small data? I want to make sure if this happen
only on large data.

Thanks!



2016-09-18 6:42 GMT+09:00 Kevin Burton :

> I'm seeing some weird behavior and wanted some feedback.
>
> I have a fairly large, multi-hour job that operates over about 5TB of data.
>
> It builds it out into a ranked category index of about 25000 categories
> sorted by rank, descending.
>
> I want to write this to a file but it's not actually writing any data.
>
> if I run myrdd.take(100) ... that works fine and prints data to a file.
>
> If I run
>
> myrdd.write.json(), it takes the same amount of time, and then writes a
> local file with a SUCCESS file but no actual partition data in the file.
> There's only one small file with SUCCESS.
>
> Any advice on how to debug this?
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


How many are there PySpark Windows users?

2016-09-17 Thread Hyukjin Kwon
Hi all,

We are currently testing SparkR on Windows[1] and it seems several problems
are being identified time to time. Although It seems it is not easy to
automate Spark's tests in Scala on Windows because I think we should
introduce a proper change detection to run only related tests rather than
running whole tests and etc., I think it is feasible to automate PySpark
tests on Windows because it does not take so long time.

I believe there are pretty many data analysts using SparkR on Windows but I
am not sure how many guys are using PySpark on Windows.

Currently, there are some issues with SparkR testing automation (e.g.
triggering some builds/tests on PRs not related with SparkR) so we might
have to try this latter but I just want to know and make sure before my
future PR and taking a look about this futher and then I would like to add
this thread as a reference.

I would appreciate if any of you shares your opinion/experience or some
issues on Windows in PySpark.

Thank you very much.

[1] https://github.com/apache/spark/pull/14859


Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark
with the data?

On 20 Sep 2016 2:15 a.m., "Mohamed ismail" 
wrote:

> Hi all
>
> I am trying to read:
>
> sc.textFile(DataFile).mapPartitions(lines => {
> val parser = new CSVParser(",")
> lines.map(line=>parseLineToTuple(line, parser))
> })
> Data looks like:
> android phone,0,0,0,,0,0,0,0,0,0,0,5,0,0,0,5,0,0.0,0.0,0.
> 0,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
> ios phone,0,-1,0,,0,0,0,0,0,0,1,0,0,0,0,1,0,0.0,0.0,0.
> 0,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 23055.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 23055.0 (TID 191607, ):
> java.lang.NumberFormatException: For input string: "0.0"
>
> Has anyone faced such issues. Is there a solution?
>
> Thanks,
> Mohamed
>
>


Re: NumberFormatException: For input string: "0.00000"

2016-09-19 Thread Hyukjin Kwon
It seems not an issue in Spark. Does "CSVParser" works fine without Spark
with the data?

BTW, it seems there is something wrong with your email address. I am
sending this again.

On 20 Sep 2016 8:32 a.m., "Hyukjin Kwon"  wrote:

> It seems not an issue in Spark. Does "CSVParser" works fine without Spark
> with the data?
>
> On 20 Sep 2016 2:15 a.m., "Mohamed ismail" 
> wrote:
>
>> Hi all
>>
>> I am trying to read:
>>
>> sc.textFile(DataFile).mapPartitions(lines => {
>> val parser = new CSVParser(",")
>> lines.map(line=>parseLineToTuple(line, parser))
>> })
>> Data looks like:
>> android phone,0,0,0,,0,0,0,0,0,0,0,5,0,0,0,5,0,0.0,0.0,0.000
>> 00,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>> ios phone,0,-1,0,,0,0,0,0,0,0,1,0,0,0,0,1,0,0.0,0.0,0.00
>> 000,0.0,0.0,0,0,0,0,0,0,0,0.0,0,0,0
>>
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 1 in stage 23055.0 failed 4 times, most recent failure: Lost task 1.3 in
>> stage 23055.0 (TID 191607, ):
>> java.lang.NumberFormatException: For input string: "0.0"
>>
>> Has anyone faced such issues. Is there a solution?
>>
>> Thanks,
>> Mohamed
>>
>>


Re: Issue with rogue data in csv file used in Spark application

2016-09-27 Thread Hyukjin Kwon
Hi Mich,

I guess you could use nullValue option by setting it to null.

If you are reading them into strings at the first please, then, you would
meet https://github.com/apache/spark/pull/14118 first which is resolved
from 2.0.1

Unfortunately, this bug also exists in external csv library for strings if
I recall correctly.

However, it'd be fine if you set the schema explicitly when you load as
this bug does not exists for floats at least.

I hope this is helpful.

Thanks!

On 28 Sep 2016 7:06 a.m., "Mich Talebzadeh" 
wrote:

> Thanks guys
>
> Actually these are the 7 rogue rows. The column 0 is the Volume column
> which means there was no trades on those days
>
>
> *cat stock.csv|grep ",0"*SAP SE,SAP, 23-Dec-11,-,-,-,40.56,0
> SAP SE,SAP, 21-Apr-11,-,-,-,45.85,0
> SAP SE,SAP, 30-Dec-10,-,-,-,38.10,0
> SAP SE,SAP, 23-Dec-10,-,-,-,38.36,0
> SAP SE,SAP, 30-Apr-08,-,-,-,32.39,0
> SAP SE,SAP, 29-Apr-08,-,-,-,33.05,0
> SAP SE,SAP, 28-Apr-08,-,-,-,32.60,0
>
> So one way would be to exclude the rows that there was no volume of trade
> that day when cleaning up the csv file
>
> *cat stock.csv|grep -v **",0"*
>
> and that works. Bearing in mind that putting 0s in place of "-" will skew
> the price plot.
>
> BTW I am using Spark csv as well
>
> val df1 = spark.read.option("header", true).csv(location)
>
> This is the class and the mapping
>
>
> case class columns(Stock: String, Ticker: String, TradeDate: String, Open:
> Float, High: Float, Low: Float, Close: Float, Volume: Integer)
> val df2 = df1.map(p => columns(p(0).toString, p(1).toString,
> p(2).toString, p(3).toString.toFloat, p(4).toString.toFloat,
> p(5).toString.toFloat, p(6).toString.toFloat, p(7).toString.toInt))
>
>
> In here I have
>
> p(3).toString.toFloat
>
> How can one check for rogue data in p(3)?
>
>
> Thanks
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 27 September 2016 at 21:49, Mich Talebzadeh 
> wrote:
>
>>
>> I have historical prices for various stocks.
>>
>> Each csv file has 10 years trade one row per each day.
>>
>> These are the columns defined in the class
>>
>> case class columns(Stock: String, Ticker: String, TradeDate: String,
>> Open: Float, High: Float, Low: Float, Close: Float, Volume: Integer)
>>
>> The issue is with Open, High, Low, Close columns that all are defined as
>> Float.
>>
>> Most rows are OK like below but the red one with "-" defined as Float
>> causes issues
>>
>>   Date Open High  Low   Close Volume
>> 27-Sep-16 80.91 80.93 79.87 80.85 1873158
>> 23-Dec-11   - --40.56 0
>>
>> Because the prices are defined as Float, these rows cause the application
>> to crash
>> scala> val rs = df2.filter(changeToDate("TradeDate") >=
>> monthsago).select((changeToDate("TradeDate").as("TradeDate")
>> ),(('Close+'Open)/2).as("AverageDailyPrice"), 'Low.as("Day's Low"),
>> 'High.as("Day's High")).orderBy("TradeDate").collect
>> 16/09/27 21:48:53 ERROR Executor: Exception in task 0.0 in stage 61.0
>> (TID 260)
>> java.lang.NumberFormatException: For input string: "-"
>>
>>
>> One way is to define the prices as Strings but that is not
>> meaningful. Alternatively do the clean up before putting csv in HDFS but
>> that becomes tedious and error prone.
>>
>> Any ideas will be appreciated.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>


Re: spark sql on json

2016-09-29 Thread Hyukjin Kwon
https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/sql/JavaSparkSQLExample.java#L104-L181

2016-09-29 18:58 GMT+09:00 Hitesh Goyal :

> Hi team,
>
>
>
> I have a json document. I want to put spark SQL to it.
>
> Can you please send me an example app built in JAVA so that I would be
> able to put spark sql queries on my data.
>
>
>
> Regards,
>
> *Hitesh Goyal*
>
> Simpli5d Technologies
>
> Cont No.: 9996588220
>
>
>


Re: pyspark: sqlContext.read.text() does not work with a list of paths

2016-10-06 Thread Hyukjin Kwon
It seems obviously a bug. It was introduced from my PR,
https://github.com/apache/spark/commit/d37c7f7f042f7943b5b684e53cf4284c601fb347

+1 for creating a JIRA and PR. If you have any problem with this, I would
like to do this quickly.


On 5 Oct 2016 9:12 p.m., "Laurent Legrand"  wrote:

> Hello,
>
> When I try to load multiple text files with the sqlContext, I get the
> following error:
>
> spark-2.0.0-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/
> sql/readwriter.py",
> line 282, in text
> UnboundLocalError: local variable 'path' referenced before assignment
>
> According to the code
> (https://github.com/apache/spark/blob/master/python/pyspark/
> sql/readwriter.py#L291),
> the variable 'path' is not set if the argument is not a string.
>
> Could you confirm it is a bug?
>
> Regards,
> Laurent
>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-sqlContext-read-text-does-not-
> work-with-a-list-of-paths-tp27838.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Support for uniVocity in Spark 2.x

2016-10-06 Thread Hyukjin Kwon
Yeap, there is an option to switch Apache common CSV to Univocity in
external CSV library but it become univocity by default and Apache common
CSV was removed during porting it into Spark 2.0.

On 7 Oct 2016 2:53 a.m., "Sean Owen"  wrote:

> It still uses univocity, but this is an implementation detail, so I don't
> think that has amounted to supporting or not supporting it.
>
> On Thu, Oct 6, 2016 at 4:00 PM Jean Georges Perrin  wrote:
>
>>
>> The original CSV parser from Databricks had support for uniVocity in
>> Spark 1.x. Can someone confirm it has disappeared (per:
>> https://spark.apache.org/docs/latest/api/java/org/
>> apache/spark/sql/streaming/DataStreamReader.html#csv(java.lang.String) )
>> in 2.0.x?
>>
>> Thanks!
>>
>> jg
>>
>>
>>
>>


Re: JSON Arrays and Spark

2016-10-10 Thread Hyukjin Kwon
FYI, it supports

[{...}, {...} ...]

Or

{...}

format as input.

On 11 Oct 2016 3:19 a.m., "Jean Georges Perrin"  wrote:

> Thanks Luciano - I think this is my issue :(
>
> On Oct 10, 2016, at 2:08 PM, Luciano Resende  wrote:
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.
> html#json-datasets
>
> Particularly the note at the required format :
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
>
>
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin  wrote:
>
>> Hi folks,
>>
>> I am trying to parse JSON arrays and it’s getting a little crazy (for me
>> at least)…
>>
>> 1)
>> If my JSON is:
>> {"vals":[100,500,600,700,800,200,900,300]}
>>
>> I get:
>> ++
>> |vals|
>> ++
>> |[100, 500, 600, 7...|
>> ++
>>
>> root
>>  |-- vals: array (nullable = true)
>>  ||-- element: long (containsNull = true)
>>
>> and I am :)
>>
>> 2)
>> If my JSON is:
>> [100,500,600,700,800,200,900,300]
>>
>> I get:
>> ++
>> | _corrupt_record|
>> ++
>> |[100,500,600,700,...|
>> ++
>>
>> root
>>  |-- _corrupt_record: string (nullable = true)
>>
>> Both are legit JSON structures… Do you think that #2 is a bug?
>>
>> jg
>>
>>
>>
>>
>>
>>
>
>
> --
> Luciano Resende
> http://twitter.com/lresende1975
> http://lresende.blogspot.com/
>
>
>


Re: JSON Arrays and Spark

2016-10-11 Thread Hyukjin Kwon
No, I meant it should be in a single line but it supports array type too as
a root wrapper of JSON objects.

If you need to parse multiple lines, I have a reference here.

http://searchdatascience.com/spark-adventures-1-processing-multi-line-json-files/

2016-10-12 15:04 GMT+09:00 Kappaganthu, Sivaram (ES) <
sivaram.kappagan...@adp.com>:

> Hi,
>
>
>
> Does this mean that handling any Json with kind of below schema  with
> spark is not a good fit?? I have requirement to parse the below Json that
> spans across multiple lines. Whats the best way to parse the jsns of this
> kind?? Please suggest.
>
>
>
> root
>
> |-- maindate: struct (nullable = true)
>
> ||-- mainidnId: string (nullable = true)
>
> |-- Entity: array (nullable = true)
>
> ||-- element: struct (containsNull = true)
>
> |||-- Profile: struct (nullable = true)
>
> ||||-- Kind: string (nullable = true)
>
> |||-- Identifier: string (nullable = true)
>
> |||-- Group: array (nullable = true)
>
> ||||-- element: struct (containsNull = true)
>
> |||||-- Period: struct (nullable = true)
>
> ||||||-- pid: string (nullable = true)
>
> ||||||-- pDate: string (nullable = true)
>
> ||||||-- quarter: long (nullable = true)
>
> ||||||-- labour: array (nullable = true)
>
> |||||||-- element: struct (containsNull = true)
>
> ||||||||-- category: string (nullable = true)
>
> ||||||||-- id: string (nullable = true)
>
> ||||||||-- person: struct (nullable = true)
>
> |||||||||-- address: array (nullable =
> true)
>
> ||||||||||-- element: struct
> (containsNull = true)
>
> |||||||||||-- city: string
> (nullable = true)
>
> |||||||||||-- line1: string
> (nullable = true)
>
> |||||||||||-- line2: string
> (nullable = true)
>
> |||||||||||-- postalCode: string
> (nullable = true)
>
> |||||||||||-- state: string
> (nullable = true)
>
> |||||||||||-- type: string
> (nullable = true)
>
> |||||||||-- familyName: string (nullable =
> true)
>
> ||||||||-- tax: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- code: string (nullable =
> true)
>
> ||||||||||-- qwage: double (nullable =
> true)
>
> ||||||||||-- qvalue: double (nullable
> = true)
>
> ||||||||||-- qSubjectvalue: double
> (nullable = true)
>
> ||||||||||-- qfinalvalue: double
> (nullable = true)
>
> ||||||||||-- ywage: double (nullable =
> true)
>
> ||||||||||-- yalue: double (nullable =
> true)
>
> ||||||||||-- ySubjectvalue: double
> (nullable = true)
>
> ||||||||||-- yfinalvalue: double
> (nullable = true)
>
> ||||||||-- tProfile: array (nullable = true)
>
> |||||||||-- element: struct (containsNull
> = true)
>
> ||||||||||-- isExempt: boolean
> (nullable = true)
>
> ||||||||||-- jurisdiction: struct
> (nullable = true)
>
> |||||||||||-- code: string
> (nullable = true)
>
> ||||||||||-- maritalStatus: string
> (nullable = true)
>
> ||||||||||-- numberOfDeductions: long
> (nullable = true)
>
> ||||||||-- wDate: struct (nullable = true)
>
> |||||||||-- originalHireDate: string
> (nullable = true)
>
> ||||||-- year: long (nullable = true)
>
>
>
>
>
> *From:* Luciano Resende [mailto:luckbr1...@gmail.com]
> *Sent:* Monday, October 10, 2016 11:39 PM
> *To:* Jean Georges Perrin
> *Cc:* user @spark
> *Subject:* Re: JSON Arrays and Spark
>
>
>
> Please take a look at
> http://spark.apache.org/docs/latest/sql-programming-guide.
> html#json-datasets
>
> Particularly the note at the required format :
>
> Note that the file that is offered as *a json file* is not a typical JSON
> file. Each line must contain a separate, self-contained valid JSON object.
> As a consequence, a regular multi-line JSON file will most often fail.
>
>
>
> On Mon, Oct 10, 2016 at 9:57 AM, Jean Georges Perrin  wrote:

Re: Why the json file used by sparkSession.read.json must be a valid json object per line

2016-10-18 Thread Hyukjin Kwon
Regarding his recent PR[1], I guess he meant multiple line json.

As far as I know, single line json also conplies the standard. I left a
comment with RFC in the PR but please let me know if I am wrong at any
point.

Thanks!

[1]https://github.com/apache/spark/pull/15511

On 19 Oct 2016 7:00 a.m., "Daniel Barclay" 
wrote:

> Koert,
>
> Koert Kuipers wrote:
>
> A single json object would mean for most parsers it needs to fit in memory
> when reading or writing
>
> Note that codlife didn't seem to being asking about *single-object* JSON
> files, but about *standard-format* JSON files.
>
>
> On Oct 15, 2016 11:09, "codlife" <1004910...@qq.com> wrote:
>
>> Hi:
>>I'm doubt about the design of spark.read.json,  why the json file is
>> not
>> a standard json file, who can tell me the internal reason. Any advice is
>> appreciated.
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Why-the-json-file-used-by-sparkSession
>> -read-json-must-be-a-valid-json-object-per-line-tp27907.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: how to extract arraytype data to file

2016-10-18 Thread Hyukjin Kwon
This reminds me of
https://github.com/databricks/spark-xml/issues/141#issuecomment-234835577

Maybe using explode() would be helpful.

Thanks!

2016-10-19 14:05 GMT+09:00 Divya Gehlot :

> http://stackoverflow.com/questions/33864389/how-can-i-
> create-a-spark-dataframe-from-a-nested-array-of-struct-element
>
> Hope this helps
>
>
> Thanks,
> Divya
>
> On 19 October 2016 at 11:35, lk_spark  wrote:
>
>> hi,all:
>> I want to read a json file and search it by sql .
>> the data struct should be :
>>
>> bid: string (nullable = true)
>> code: string (nullable = true)
>>
>> and the json file data should be like :
>>  {bid":"MzI4MTI5MzcyNw==","code":"罗甸网警"}
>>  {"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}
>> but in fact my json file data is :
>> {"bizs":[ {bid":"MzI4MTI5MzcyNw==","code
>> ":"罗甸网警"},{"bid":"MzI3MzQ5Nzc2Nw==","code":"西早君"}]}
>> {"bizs":[ {bid":"MzI4MTI5Mzcy00==","code
>> ":"罗甸网警"},{"bid":"MzI3MzQ5Nzc201==","code":"西早君"}]}
>> I load it by spark ,data schema shows like this :
>>
>> root
>>  |-- bizs: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- bid: string (nullable = true)
>>  |||-- code: string (nullable = true)
>>
>>
>> I can select columns by : df.select("bizs.id","bizs.name")
>> but the colume values is in array type:
>> +++
>> |  id|code|
>> +++
>> |[4938200, 4938201...|[罗甸网警, 室内设计师杨焰红, ...|
>> |[4938300, 4938301...|[SDCS十全九美, 旅梦长大, ...|
>> |[4938400, 4938401...|[日重重工液压行走回转, 氧老家,...|
>> |[4938500, 4938501...|[PABXSLZ, 陈少燕, 笑蜜...|
>> |[4938600, 4938601...|[税海微云, 西域美农云家店, 福...|
>> +++
>>
>> what I want is I can read colum in normal row type. how I can do it ?
>> 2016-10-19
>> --
>> lk_spark
>>
>
>


Re: pyspark doesn't recognize MMM dateFormat pattern in spark.read.load() for dates like 1989Dec31 and 31Dec1989

2016-10-24 Thread Hyukjin Kwon
I am also interested in this issue. I will try to look into this too within
coming few days..

2016-10-24 21:32 GMT+09:00 Sean Owen :

> I actually think this is a general problem with usage of DateFormat and
> SimpleDateFormat across the code, in that it relies on the default locale
> of the JVM. I believe this needs to, at least, default consistently to
> Locale.US so that behavior is consistent; otherwise it's possible that
> parsing and formatting of dates could work subtly differently across
> environments.
>
> There's a similar question about some code that formats dates for the UI.
> It's more reasonable to let that use the platform-default locale, but, I'd
> still favor standardizing it I think.
>
> Anyway, let me test it out a bit and possibly open a JIRA with this change
> for discussion.
>
> On Mon, Oct 24, 2016 at 1:03 PM pietrop  wrote:
>
> Hi there,
> I opened a question on StackOverflow at this link:
> http://stackoverflow.com/questions/40007972/pyspark-doesnt-recognize-mmm-
> dateformat-pattern-in-spark-read-load-for-dates?
> noredirect=1#comment67297930_40007972
>
> I didn’t get any useful answer, so I’m writing here hoping that someone can
> help me.
>
> In short, I’m trying to read a CSV containing data columns stored using the
> pattern “MMMdd”. What doesn’t work for me is “MMM”. I’ve done some
> testing and discovered that it’s a localization issue. As you can read from
> the StackOverflow question, I run a simple Java code to parse the date
> “1989Dec31” and it works only if I specify Locale.US in the
> SimpleDateFormat() function.
>
> I would like pyspark to work. I tried setting a different local from
> console
> (LANG=“en_US”), but it doesn’t work. I tried also setting it using the
> locale package from Python.
>
> So, there’s a way to set locale in Spark when using pyspark? The issue is
> Java related and not Python related (the function that parses data is
> invoked by spark.read.load(dateFormat=“MMMdd”, …). I don’t want to use
> other solutions in order to encode data because they are slower (from what
> I’ve seen so far).
>
> Thank you
> Pietro
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-doesn-t-recognize-MMM-
> dateFormat-pattern-in-spark-read-load-for-dates-like-
> 1989Dec31-and-31D9-tp27951.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
There are now timestampFormat for TimestampType and dateFormat for DateType.

Do you mind if I ask to share your codes?

On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:

> is there a reason a column with dates in format -mm-dd in a csv file
> is inferred to be TimestampType and not DateType?
>
> thanks! koert
>


Re: spark infers date to be timestamp type

2016-10-26 Thread Hyukjin Kwon
Hi Koert,


Sorry, I thought you meant this is a regression between 2.0.0 and 2.0.1. I
just checked It has not been supporting to infer DateType before[1].

Yes, it only supports to infer such data as timestamps currently.


[1]
https://github.com/apache/spark/blob/branch-2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala#L85-L92




2016-10-27 9:12 GMT+09:00 Anand Viswanathan :

> Hi,
>
> you can use the customSchema(for DateType) and specify dateFormat in
> .option().
> or
> at spark dataframe side, you can convert the timestamp to date using cast
> to the column.
>
> Thanks and regards,
> Anand Viswanathan
>
> On Oct 26, 2016, at 8:07 PM, Koert Kuipers  wrote:
>
> hey,
> i create a file called test.csv with contents:
> date
> 2015-01-01
> 2016-03-05
>
> next i run this code in spark 2.0.1:
> spark.read
>   .format("csv")
>   .option("header", true)
>   .option("inferSchema", true)
>   .load("test.csv")
>   .printSchema
>
> the result is:
> root
>  |-- date: timestamp (nullable = true)
>
>
> On Wed, Oct 26, 2016 at 7:35 PM, Hyukjin Kwon  wrote:
>
>> There are now timestampFormat for TimestampType and dateFormat for
>> DateType.
>>
>> Do you mind if I ask to share your codes?
>>
>> On 27 Oct 2016 2:16 a.m., "Koert Kuipers"  wrote:
>>
>>> is there a reason a column with dates in format -mm-dd in a csv file
>>> is inferred to be TimestampType and not DateType?
>>>
>>> thanks! koert
>>>
>>
>
>


Re: csv date/timestamp type inference in spark 2.0.1

2016-10-26 Thread Hyukjin Kwon
Hi Koert,


I am curious about your case. I guess the purpose of timestampFormat and
dateFormat is to infer timestamps/dates when parsing/inferring

but not to exclude the type inference/parsing. Actually, it does try to
infer/parse in 2.0.0 as well (but it fails) so actually I guess there
wouldn't be a big performance difference.


I guess it is type inference and therefore it is the right behaviour that
it tries to do its best to infer the appropriate type inclusively.

Why don't you just cast the timestamps to strings?


Thanks.


2016-10-27 9:47 GMT+09:00 Koert Kuipers :

> i tried setting both dateFormat and timestampFormat to impossible values
> (e.g. "~|.G~z~a|wW") and it still detected my data to be TimestampType
>
> On Wed, Oct 26, 2016 at 1:15 PM, Koert Kuipers  wrote:
>
>> we had the inference of dates/timestamps when reading csv files disabled
>> in spark 2.0.0 by always setting dateFormat to something impossible (e.g.
>> dateFormat "~|.G~z~a|wW")
>>
>> i noticed in spark 2.0.1 that setting this impossible dateFormat does not
>> stop spark from inferring it is a date or timestamp type anyhow. is this
>> intentional? how do i disable inference of datetype/timestamp type now?
>>
>> thanks! koert
>>
>>
>


Re: Spark XML ignore namespaces

2016-11-03 Thread Hyukjin Kwon
Oh, that PR was actually about not concerning the namespaces (meaning
leaving data as they are, including prefixes).


The problem was, each partition needs to produce each record with knowing
the namesapces.

It is fine to deal with them if they are within each XML documentation
(represented as a row in dataframe) but

it becomes problematic if they are in the parent of each XML documentation
(represented as a row in dataframe).


There is an issue open for this,
https://github.com/databricks/spark-xml/issues/74

It'd be nicer if we have an option to enable/disable this if we can
properly support namespace handling.


We might be able to talk more there.



2016-11-04 6:37 GMT+09:00 Arun Patel :

> I see that 'ignoring namespaces' issue is resolved.
>
> https://github.com/databricks/spark-xml/pull/75
>
> How do we enable this option and ignore namespace prefixes?
>
> - Arun
>


Re: Error creating SparkSession, in IntelliJ

2016-11-03 Thread Hyukjin Kwon
Hi Shyla,

there is the documentation for setting up IDE -
https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup

I hope this is helpful.


2016-11-04 9:10 GMT+09:00 shyla deshpande :

> Hello Everyone,
>
> I just installed Spark 2.0.1, spark shell works fine.
>
> Was able to run some simple programs from the Spark Shell, but find it
> hard to make the same program work when using IntelliJ.
>  I am getting the following error.
>
> Exception in thread "main" java.lang.NoSuchMethodError:
> scala.Predef$.$scope()Lscala/xml/TopScope$;
> at org.apache.spark.ui.jobs.StagePage.(StagePage.scala:44)
> at org.apache.spark.ui.jobs.StagesTab.(StagesTab.scala:34)
> at org.apache.spark.ui.SparkUI.(SparkUI.scala:62)
> at org.apache.spark.ui.SparkUI$.create(SparkUI.scala:215)
> at org.apache.spark.ui.SparkUI$.createLiveUI(SparkUI.scala:157)
> at org.apache.spark.SparkContext.(SparkContext.scala:440)
> at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2275)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:831)
> at org.apache.spark.sql.SparkSession$Builder$$anonfun$
> 8.apply(SparkSession.scala:823)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.
> scala:823)
> at SparkSessionTest.DataSetWordCount$.main(DataSetWordCount.scala:15)
> at SparkSessionTest.DataSetWordCount.main(DataSetWordCount.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
>
> Thanks
> -Shyla
>
>
>


Re: Reading csv files with quoted fields containing embedded commas

2016-11-06 Thread Hyukjin Kwon
Hi Femi,

Have you maybe tried the quote related options specified in the
documentation?

http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.csv

Thanks.

2016-11-06 6:58 GMT+09:00 Femi Anthony :

> Hi, I am trying to process a very large comma delimited csv file and I am
> running into problems.
> The main problem is that some fields contain quoted strings with embedded
> commas.
> It seems as if PySpark is unable to properly parse lines containing such
> fields like say Pandas does.
>
> Here is the code I am using to read the file in Pyspark
>
> df_raw=spark.read.option("header","true").csv(csv_path)
>
> Here is an example of a good and 'bad' line in such a file:
>
>
> col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,
> col12,col13,col14,col15,col16,col17,col18,col19
> 80015360210876000,11.22,X,4076710258,,,sxsw,,"32 YIU ""A""",S5,,"32 XIY
> ""W""   JK, RE LK",SOMETHINGLIKEAPHENOMENON#YOUGOTSOUL~BRINGDANOISE,23.0,
> cyclingstats,2012-25-19,432,2023-05-17,CODERED
> 6167229561918,137.12,U,8234971771,,,woodstock,,,T4,,,
> OUTKAST#THROOTS~WUTANG#RUNDMC,0.0,runstats,2013-21-22,1333,
> 2019-11-23,CODEBLUE
>
> Line 0 is the header
> Line 1 is the 'problematic' line
> Line 2 is a good line.
>
> Pandas can handle this easily:
>
>
> [1]: import pandas as pd
>
> In [2]: pdf = pd.read_csv('malformed_data.csv')
>
> In [4]: pdf[['col12','col13','col14']]
> Out[4]:
> col12
> col13  \
> 0  32 XIY "W"   JK, RE LK  SOMETHINGLIKEAPHENOMENON#
> YOUGOTSOUL~BRINGDANOISE
> 1 NaN OUTKAST#THROOTS~WUTANG#RUNDMC
>
>col14
> 0   23.0
> 10.0
>
>
> while Pyspark seems to parse this erroneously:
>
> [5]: sdf=spark.read.format("org.apache.spark.csv").csv('
> malformed_data.csv',header=True)
>
> [6]: sdf.select("col12","col13",'col14').show()
> +--+++
> | col12|   col13|   col14|
> +--+++
> |"32 XIY ""W""   JK|  RE LK"|SOMETHINGLIKEAPHE...|
> |  null|OUTKAST#THROOTS~W...| 0.0|
> +--+++
>
>  Is this a bug or am I doing something wrong ?
>  I am working with Spark 2.0
>  Any help is appreciated
>
> Thanks,
> -- Femi
>
> http://www.nextmatrix.com
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


  1   2   >