Re: Reading a TSV file

Mich Talebzadeh Sat, 10 Sep 2016 06:46:52 -0700

This should be pretty straight forward?

You can create a tab separated file from any database table and buck copy
out, MSSQL, Sybase etc


 bcp scratchpad..nw_10124772 out nw_10124772.tsv -c *-t '\t' *-Usa -A16384
Password:
Starting copy...
441 rows copied.

more nw_10124772.tsv
Mar 22 2011 12:00:00:000AM      SBT     602424  10124772        FUNDS
TRANSFER , FROM A/C 17904064      200.00          200.00
Mar 22 2011 12:00:00:000AM      SBT     602424  10124772        FUNDS
TRANSFER , FROM A/C 36226823      454.74          654.74

Put that file into hdfs. Note that it has no headers

Read in as a tsv file

scala> val df2 = spark.read.option("header", true).option("delimiter","\t")
.csv("hdfs://rhes564:9000/tmp/nw_10124772.tsv")
df2: org.apache.spark.sql.DataFrame = [Mar 22 2011 12:00:00:000AM: string,
SBT: string ... 6 more fields]

scala> df2.first
res7: org.apache.spark.sql.Row = [Mar 22 2011
12:00:00:000AM,SBT,602424,10124772,FUNDS TRANSFER , FROM A/C
17904064,200.00,,200.00]

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 September 2016 at 13:57, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Thanks Jacek.
>
> The old stuff with databricks
>
> scala> val df = 
> spark.read.format("com.databricks.spark.csv").option("inferSchema",
> "true").option("header", "true").load("hdfs://rhes564:
> 9000/data/stg/accounts/ll/18740868")
> df: org.apache.spark.sql.DataFrame = [Transaction Date: string,
> Transaction Type: string ... 7 more fields]
>
> Now I can do
>
> scala> val df2 = spark.read.option("header", true).csv("hdfs://rhes564:
> 9000/data/stg/accounts/ll/18740868")
> df2: org.apache.spark.sql.DataFrame = [Transaction Date: string,
> Transaction Type: string ... 7 more fields]
>
> About Schema stuff that apparently Spark works out itself
>
> scala> df.printSchema
> root
>  |-- Transaction Date: string (nullable = true)
>  |-- Transaction Type: string (nullable = true)
>  |-- Sort Code: string (nullable = true)
>  |-- Account Number: integer (nullable = true)
>  |-- Transaction Description: string (nullable = true)
>  |-- Debit Amount: double (nullable = true)
>  |-- Credit Amount: double (nullable = true)
>  |-- Balance: double (nullable = true)
>  |-- _c8: string (nullable = true)
>
> scala> df2.printSchema
> root
>  |-- Transaction Date: string (nullable = true)
>  |-- Transaction Type: string (nullable = true)
>  |-- Sort Code: string (nullable = true)
>  |-- Account Number: string (nullable = true)
>  |-- Transaction Description: string (nullable = true)
>  |-- Debit Amount: string (nullable = true)
>  |-- Credit Amount: string (nullable = true)
>  |-- Balance: string (nullable = true)
>  |-- _c8: string (nullable = true)
>
> Cheers
>
>
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 13:12, Jacek Laskowski <ja...@japila.pl> wrote:
>
>> Hi Mich,
>>
>> CSV is now one of the 7 formats supported by SQL in 2.0. No need to
>> use "com.databricks.spark.csv" and --packages. A mere format("csv") or
>> csv(path: String) would do it. The options are same.
>>
>> p.s. Yup, when I read TSV I thought about time series data that I
>> believe got its own file format and support @ spark-packages.
>>
>> Pozdrawiam,
>> Jacek Laskowski
>> ----
>> https://medium.com/@jaceklaskowski/
>> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> Follow me at https://twitter.com/jaceklaskowski
>>
>>
>> On Sat, Sep 10, 2016 at 8:00 AM, Mich Talebzadeh
>> <mich.talebza...@gmail.com> wrote:
>> > I gather the title should say CSV as opposed to tsv?
>> >
>> > Also when the term spark-csv is used is it a reference to databricks
>> stuff?
>> >
>> > val df = spark.read.format("com.databricks.spark.csv").option("
>> inferSchema",
>> > "true").option("header", "true").load......
>> >
>> > or it is something new in 2 like spark-sql etc?
>> >
>> > Thanks
>> >
>> > Dr Mich Talebzadeh
>> >
>> >
>> >
>> > LinkedIn
>> > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ
>> d6zP6AcPCCdOABUrV8Pw
>> >
>> >
>> >
>> > http://talebzadehmich.wordpress.com
>> >
>> >
>> > Disclaimer: Use it at your own risk. Any and all responsibility for any
>> > loss, damage or destruction of data or any other property which may
>> arise
>> > from relying on this email's technical content is explicitly
>> disclaimed. The
>> > author will in no case be liable for any monetary damages arising from
>> such
>> > loss, damage or destruction.
>> >
>> >
>> >
>> >
>> > On 10 September 2016 at 12:37, Jacek Laskowski <ja...@japila.pl> wrote:
>> >>
>> >> Hi,
>> >>
>> >> If Spark 2.0 supports a format, use it. For CSV it's csv() or
>> >> format("csv"). It should be supported by Scala and Java. If the API's
>> >> broken for Java (but works for Scala), you'd have to create a "bridge"
>> >> yourself or report an issue in Spark's JIRA @
>> >> https://issues.apache.org/jira/browse/SPARK.
>> >>
>> >> Have you run into any issues with CSV and Java? Share the code.
>> >>
>> >> Pozdrawiam,
>> >> Jacek Laskowski
>> >> ----
>> >> https://medium.com/@jaceklaskowski/
>> >> Mastering Apache Spark 2.0 http://bit.ly/mastering-apache-spark
>> >> Follow me at https://twitter.com/jaceklaskowski
>> >>
>> >>
>> >> On Sat, Sep 10, 2016 at 7:30 AM, Muhammad Asif Abbasi
>> >> <asif.abb...@gmail.com> wrote:
>> >> > Hi,
>> >> >
>> >> > I would like to know what is the most efficient way of reading tsv in
>> >> > Scala,
>> >> > Python and Java with Spark 2.0.
>> >> >
>> >> > I believe with Spark 2.0 CSV is a native source based on Spark-csv
>> >> > module,
>> >> > and we can potentially read a "tsv" file by specifying
>> >> >
>> >> > 1. Option ("delimiter","\t") in Scala
>> >> > 2. sep declaration in Python.
>> >> >
>> >> > However I am unsure what is the best way to achieve this in Java.
>> >> > Furthermore, are the above most optimum ways to read a tsv file?
>> >> >
>> >> > Appreciate a response on this.
>> >> >
>> >> > Regards.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>> >>
>> >
>>
>
>

Re: Reading a TSV file

Reply via email to